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who will probably never read this book. 


Preface 


This book is about extracting information from noisy data, making decisions that 
have uncertain consequences, and mitigating the potentially detrimental effects of 
uncertainty. 

Applications of those ideas are prevalent in computer science and electrical 
engineering: digital communication, GPS, self-driving cars, voice recognition, 
natural language processing, face recognition, computational biology, medical tests, 
radar systems, games of chance, investments, data science, machine learning, 
artificial intelligence, and countless (in a colloquial sense) others. 

This material is truly exciting and fun. I hope you will share my enthusiasm for 
the ideas. 


Berkeley, CA, USA Jean Walrand 
April 2020 
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This book is about applications of probability in electrical engineering and computer 
science. It is not a survey of all the important applications. That would be too ambi- 
tious. Rather, the course describes real, important, and representative applications 
that make use of a fairly wide range of probability concepts and techniques. 

Probabilistic modeling and analysis are essential skills for computer scientists 
and electrical engineers. These skills are as important as calculus and discrete 
mathematics. The systems that these scientists and engineers use and/or design are 
complex and operate in an uncertain environment. Understanding and quantifying 
the impact of this uncertainty is critical to the design of systems. 

The book was written for the upper-division course EECS126 "Probability in 
EECS" in the Department of Electrical Engineering and Computer Sciences of the 
University of California, Berkeley. The students have taken an elementary course on 
probability. They know the concepts of event, probability, conditional probability, 
Bayes' rule, discrete random variables and their expectation. They also have some 
basic familiarity with matrix operations. The students in this class are smart, hard- 
working, and interested in clever and sophisticated ideas. After taking this course, 
the students are familiar with Markov chains, stochastic dynamic programming, 
detection, and estimation. They have both an intuitive understanding and a working 
knowledge of these concepts and their methods. Subsequently, many students go on 
to study artificial intelligence and machine learning. This course provides them with 
a background that enables them to go beyond blindly using toolboxes. 

In contrast to most introductory books on probability, the material is organized by 
applications. Instead of the usual sequence—probability space, random variables, 
expectation, detection, estimation, Markov chains—we start each topic with a 
concrete, real, and important EECS application. We introduce the theory as it is 
needed to study the applications. We believe that this approach makes the theory 
more relevant by demonstrating its usefulness as it is introduced. Moreover, an 
emphasis is on hands-on projects where the students use Python notebooks available 
from the book website to simulate and calculate. Our colleagues at Berkeley 
designed these projects carefully to reinforce the intuitive understanding of the 
concepts and to prepare the students for their own investigations. 

The chapters, except for the last one and the appendices, are divided into two 
parts: A and B. Parts A contain the key ideas that should be accessible to junior- 


xi 


xii Introduction 


level students. Parts B contain more difficult aspects of the material. It is possible 
to teach only the appendices and parts A. This would constitute a good junior-level 
course. One possible approach is to teach parts A in a first course and parts B in a 
second course. For a more ambitious course, one may teach parts A, then parts B. 
It is also possible to teach the chapters in order. The last chapter is a collection of 
more advanced topics that the reader and instructor can choose from. 

The appendices should be useful for most readers. Appendix A discusses the 
elementary notions of probability on simple examples. Students might benefit from 
a quick read of this chapter. 

Appendix B reviews the basic concepts of probability. Depending on the 
background of the students, it may be recommended to start the course with a review 
of that appendix. 

The theory starts with models of uncertain quantities. Let us denote such 
quantities by X and Y. A model enables one to calculate the expected value E (h(X)) 
of a function A (X) of X. For instance, X might specify the output of a solar panel 
every day during | month and A (X) the total energy that the panel produced. Then 
E(h(X)) is the average energy that the panel produces per month. Other examples 
are the average delay of packets in a communication network or the average time a 
data center takes to complete one job (Fig. 1). 


Fig. 1 Evaluation ? 
X + E(h(X)) 


Estimating E (h(X)) is called performance evaluation. In many cases, the system 
that handles the uncertain quantities has some parameters 0 that one can select to 
tune its operations. For instance, the orientation of the solar panels can be adjusted. 
Similarly, one may be able to tune the operations of a data center. One may model 
the effect of the parameters by a function A(X, 0) that describes the measure of 
performance in terms of the uncertain quantities X and the tuning parameters 0 
(Fig. 2). 


Fig. 2 Optimization max E(Rh(X,9)) 
0 2 


One important problem is then to find the values of the parameters 0 that 
maximize E(h(X,0)). This is not a simple problem if one does not have an 
analytical expression for this average value in terms of 0. We explain such 
optimization problems in the book. 

There are many situations where one observes Y and one is interested in guessing 
the value of X, which is not observed. As an example, X may be the signal that a 
transmitter sends and Y the signal that the receiver gets (Fig. 3). 


Fig. 3 Inference ? 
Y-X 
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Fig. 4 Control XY 


The problem of guessing X on the basis of Y is an inference problem. Examples 
include detection problems (Is there a fire? Do you have the flu?) and estimation 
problems (Where is the iPhone given the GPS signal?). Finally, there is a class of 
problems where one uses the observations to act upon a system that then changes. 
For instance, a self-driving car uses observations from laser range finders, GPS, and 
cameras to steer the car. These are control problems (Fig. 4). 

Thus, the course discusses performance evaluation, optimization, inference, and 
control problems. Some of these topics are called artificial intelligence in computer 
science and statistical signal processing in electrical engineering. Probabilists call 
them examples. Mathematicians may call them particular cases. The techniques 
used to address these topics are introduced by looking at concrete applications such 
as web search, multiplexing, digital communication, speech recognition, tracking, 
route planning, and recommendation systems. Along the way, we will meet some of 
the giants of the field. 

The website 


https://www.springer.com/us/book/9783030499945 


provides additional resources for this book, such as an Errata, Additional Problems, 
and Python Labs. 


About This Second Edition 


This second edition differs from the first in a few aspects. The Matlab exercises 
have been deleted as most students use Python. Python exercises are not included 
in the book; they can be found on the website. The appendix on Linear Algebra has 
been deleted. The relevant results from that theory are introduced in the text when 
needed. Appendix A is new. It is motivated by the realization that some students are 
confused by basic notions. The chapters on networks are new. They were requested 
by some colleagues. Basic statistics are discussed in Chap. 8. Neural networks are 
explained in Chap. 12. 
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Application: Ranking the most relevant pages in web search 
Topics: Finite Discrete Time Markov Chains, SLLN 


Background: 


* probability space (B.1.1); 

* conditional probability (B.1.5); 

* discrete random variable (B.2.1); 

* expectation and conditional expectation for discrete RVs (B.2.2), (B.3.5). 


1.1 Model 


The World Wide Web is a collection of linked web pages (Fig. 1.1). These pages 
and their links form a graph. The nodes of the graph are pages 2 and there is an 
arc (a directed edge) from i to j if page i has a link to j. 

Intuitively, a page has a high rank if other pages with a high rank point to it. (The 
actual ordering of search engines results depends also on the presence of the search 
keywords in the pages and on many other factors, in addition to the rank measure 
that we discuss here.) Thus, the rank z (i) of page i is a positive number and 


mi)= M n()PG.D.ie T, 
je 
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Fig. 1.1 Pages point to one 
another in the web. Here, 
P(A, B) — 1/2 and 

P(D, E) = 1/3 


Fig. 1.2 Larry page 


Fig. 1.3 Balance equations? 


where P(j, i) is the fraction of links in j that point to i and is zero if there is no 
such link. In our example, P(A, B) = 1/2, P(D, E) = 1/3, P(B, A) = 0, etc. 
(The basic idea of the algorithm is due to Larry Page (Fig. 1.2), hence the name 
PageRank. Since it ranks pages, the name is doubly appropriate.) 

We can write these equations in matrix notation as 


m —znP, (1.1) 


where we treat x as a row vector with components x (i) and P as a square matrix 
with entries P (i, j) (Figs. 1.3, 1.4 and 1.5). 

Equations (1.1) are called the balance equations. Note that if 2 solves these 
equations, then any multiple of x also solves the equations. For convenience, we 
normalize the solution so that the ranks of the pages add up to one, i.e., 


A nie. (1.2) 
ie A 
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Fig. 1.4 Copy of Fig. 12. 
Recall that P(A, B) = 1/2 
and P(D, E) — 1/3, etc. 


Fig. 1.5 Andrey Markov. 
1856-1922 


For the example of Fig. 1.4, the balance equations are 
n (A) = n(C) + 1(D)(1/3) 
71 (B) = 1(A)/2) + 1 (D)(1/3) + 1 (E)(1/2) 
71 (C) = 1(B) + 1(E)/2) 
71 (D) = n(A)(1/2) 
z (E) = 1 (D)(1/3). 


Solving these equations with the condition that the numbers add up to one yields 


1 
z = [x (A), ECB EO) ED Een z5l12, 9, 10, 6, 2]. 
Thus, page A has the highest rank and page E has the smallest. A search engine that 
uses this method would combine these ranks with other factors to order the pages. 
Search engines also use variations on this measure of rank. 


1.2 Markov Chain 


Imagine that you are browsing the web. After viewing a page i, say for one unit 
of time, you go to another page by clicking one of the links on page i, chosen at 
random. In this process, you go from page i to page j with probability P(i, j) 
where P(i, j) is the same as we defined earlier. The resulting sequence of pages 
that you visit is called a Markov chain, a model due to Andrey Markov (Fig. 1.4). 
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1.2.1 General Definition 


More generally, consider a finite graph with nodes 2 = (1, 2, ..., N} and directed 
edges. In this graph, some edges can go from a node to itself. To each edge (i, j) 
one assigns a positive number P (i, j) in a way that the sum of the numbers on the 
edges out of each node is equal to one. By convention, P(i, j) = 0 if there is no 
edge from i to j. 

The corresponding matrix P = [P (i, j)] with nonnegative entries and rows that 
add up to one is called a stochastic matrix. The sequence (X (n), n > 0} that goes 
from node i to node j with probability P (i, j), independently of the nodes it visited 
before, is then called a Markov chain. The nodes are called the states of the Markov 
chain and the P (i, j) are called the transition probabilities. We say that X (n) is the 
state of the Markov chain at time n, for n > 0. Also, X (0) is called the initial state. 
The graph is the state transition diagram of the Markov chain. 

Figure 1.6 shows the state transition diagrams of three Markov chains. 

Thus, our description corresponds to the following property: 


P[X(n 4-1) = j|X (n) =i, X(m),m < n] = PCG, j, Vi, je ZX,nz0. (13) 


The probability of moving from i to j does not depend on the previous states. This 
"amnesia" is called the Markov property. It formalizes the fact that X (n) is indeed 
a "state" in that it contains all the information relevant for predicting the future of 
the process. 


1.2.2 Distribution After n Steps and Invariant Distribution 


If the Markov chain is in state j with probability 77, (7) at step n for some n > 0, it 
is in state i at step n + 1 with probability 7:5..1 (7) where 


mna = D> mOPG Die. (1.4) 
je 


Indeed, the event that the Markov chain is in state i at step n + 1 is the union over 
all j of the disjoint events that it is in state j at step n and in state i at step n + 1. 
The probability of a disjoint union of events is the sum of the probabilities of the 
individual events. Also, the probability that the Markov chain is in state j at step 
and in state i at step n + 1 is zt, (j) PQ, i). 

Thus, in matrix notation, 


Tg] = TnP, 
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so that 
In = z9P",n > 0. (1.5) 


Observe that z,(i) = 7ro(i) for all n > O and alli € 2& if and only if 7o 


solves the balance equations (1.1). In that case, we say that mo is an invariant 
distribution. Thus, an invariant distribution is a nonnegative solution z of (1.1) 
whose components sum to one. 

1.3 Analysis 

Natural questions are 

* Does there exist an invariant distribution? 

* [sit unique? 


e Does z, approach an invariant distribution? 


The next sections answer those questions. 


1.3.1 Irreducibility and Aperiodicity 

We need the following definitions. 

Definition 1.1 (Irreducible, Aperiodic, Period) 

(a) A Markov chain is irreducible, if it can go from any state to any other state, 
possibly after many steps. 

(b) Assume the Markov chain is irreducible and let 


d(i) :— g.c.d.[n > 1 | P"(i, i) > 0}. (1.6) 


(If S is a set of positive integers, g.c.d.(S) is the greatest common divisor of 
these integers.) 


Then d (i) has the same value d for all i, as shown in Lemma 2.2. The Markov 


chain is aperiodic if d — 1. Otherwise, it is periodic with period d. 
o 


The Markov chains (a) and (b) in Fig. 1.6 are irreducible and (c) is not. Also, (a) 
is periodic and (b) is aperiodic. 
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Fig. 1.6 Three Markov 1 0.4 
chains with three states 1 
(1, 2, 3) and different (a) 
transition probabilities. (a) is 0.6 1 
irreducible, periodic; (b) is i 
irreducible, aperiodic; (c) is J 0.4 
not irreducible x 
© C 3C Na 
0.6 0.9 
1 0.4 
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1.3.2 Big Theorem 


Simple examples show that the answers to Q2—Q3 can be negative. For instance, 

every distribution is invariant for a Markov chain that does not move. Also, a Markov 

chain that alternates between the states 0 and 1 with z9(0) = 1 is such that z,(0) = 

1 when n is even and z,(0) = 0 when n is odd, so that 2, does not converge. 
However, we have the following key result. 


Theorem 1.1 (Big Theorem for Finite Markov Chains) 


(a) If the Markov chain is finite and irreducible, it has a unique invariant distribu- 
tion x and n (i) is the long-term fraction of time that X (n) is equal to i. 

(b) If the Markov chain is also aperiodic, then the distribution 7, of X (n) converges 
to T. E 


In this theorem, the long-term fraction of time that X (n) is equal to i is defined 
as the limit 


1 N-1 
lim s 5 1{X(n) = i}. 


n=0 


In this expression, 1{X (n) = i} takes the value 1 if X(n) = i and the value 0 
otherwise. Thus, in the expression above, the sum is the total time that the Markov 
chain is in state i during the first N steps. Dividing by N gives the fraction of time. 
Taking the limit yields the long-term fraction of time. 

The theorem says that, if the Markov chain is irreducible, this limit exists and is 
equal to zr (i). In particular, this limit does not depend on the particular realization 
of the random variables. This means that every simulation yields the same limit, as 
you will verify in Problem 1.8. 
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1.3.3 Long-Term Fraction of Time 


Why should the fraction of time that a Markov chain spends in one state converge? 
In our browsing example, if we count the time that we spend on page A over n time 
units and we divide that time by n, it turns out that the ratio converges to x (A). 

This result is similar to the fact that, when we flip a fair coin repeatedly, the 
fraction of “heads” converges to 50%. Thus, even though the coin has no memory, 
it makes sure that the fraction of heads approaches 50%. How does it do it? 

These convergence results are examples of the Law of Large Numbers. This 
law is at the core of our intuitive understanding of probability and it captures our 
notion of statistical regularity. Even though outcomes are uncertain, one can make 
predictions. Here is a statement of the result. We discuss it in Chap. 2. 


Theorem 1.2 (Strong Law of Large Numbers) Let (X (n), n > 1} be a sequence 
of i.i.d. random variables with mean u. Then 


X() 4-4 X(n) 
n 


> LL as n — oo, with probability 1. [| 


Thus, the sample mean values Y (n) := (X(1) + --- + X(n))/n converge to the 
expected value, with probability 1. (See Fig. 1.7.) Note that the sample mean values 
Y (n) are random variables: for each n, the value of Y (n) depends on the particular 
realization of the random variables X (m); if you repeat the experiment, the values 
will probably be different. However, the limit is always u, with probability 1. We 
say that the convergence is almost sure.! 


Fig. 1.7 When rolling a 45 
balanced die, the sample uh yi 
mean converges to 3.5 4} W \ i 1 
35 j Soo 
E 4 
| n 


1 ES 1 — 
20 


40 60 


Sr 


100 


'“Almost sure" is a somewhat confusing technical expression. It means that, although there are 
outcomes for which the convergence does not happen, all these outcomes have probability zero. 
For instance, if you flip a fair coin, the outcome where the coin flips keep on yielding tails 
is such that the fraction of tails does not converge to 0.5. The same is true for the outcome 
H,H,T,H,H,T,.... So, almost sure means that it happens with probability 1, but not for a 
set of outcomes that has probability zero. 
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1.4 Illustrations 


We illustrate Theorem 1.1 for the Markov chains in Fig. 1.6. The three situations are 
different and quite representative. We explore them one by one. 

Figures 1.8, 1.9 and 1.10 correspond to each of the three Markov chains in 
Fig.1.6, as shown on top of each figure. The top graph of each figure shows 
the successive values of X, for n = 0,1,...,100. The middle graph of the 
figure shows, for n = 0,...,100, the fraction of time that Xm is equal to the 
different states during {0, 1,..., n}. The bottom graph of the figure shows, for 
n = 0, ...,100, the probability that X, is equal to each of the states. 

In Fig. 1.8, the fraction of time that the Markov chain is equal to each of the 
states (1, 2, 3} converges to positive values. This is the case because the Markov 
chain is irreducible. (See Theorem 1.1(a).) However, the probability of being in a 
given state does not converge. This is because the Markov chain is periodic. (See 
Theorem 1.1(b).) 

For the Markov chain in Fig. 1.9, the probabilities converge, because the Markov 
chain is aperiodic. (See again Theorem 1.1.) 

Finally, for the Markov chain in Fig. 1.10, eventually X, — 3; the fraction of 
time in state 3 converges to one and so does the probability of being in state 3. What 
happens in this case is that state 3 is absorbing: once the Markov chain gets there, 
it cannot leave. 


Fig. 1.8 Markov chain (a) in 
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1.4 Illustrations 


Fig. 1.9 Markov chain (b) in 
Fig. 1.6 
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Fig. 1.10 Markov chain (c) 


in Fig. 1.6 
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1.5 Hitting Time 


Say that you start in page A in Fig. 1.2 and that, at every step, you follow each 
outgoing link of the page where you are with equal probabilities. How many steps 
does it take to reach page E? This time is called the hitting time, or first passage 
time, of page E and we designate it by Tg. As we can see from the figure, Tg can 
be as small as 2, but it has a good chance of being much larger than 2 (Fig. 1.11). 


1.5.1 MeanHitting Time 


Our goal is to calculate the average value of Tg starting from Xo = A. That is, we 
want to calculate 


P(A) :— E[Tz | Xo = A]. 


The key idea to perform this calculation is to in fact calculate the mean hitting time 
for all possible initial pages. That is, we will calculate B(i) fori — A, B, C, D, E 
where 


BG) = E[Tg | Xo = i]. 


The reason for considering these different values is that the mean time to hit E 
starting from A is clearly related to the mean hitting time starting from B and 
from D. These in turn are related to the mean hitting time starting from C. We 
claim that 


1 1 
P(A) = 1 + 5 B(B) + , BW). (1.7) 


To see this, note that, starting from A, after one step, the Markov chain is in state B 
with probability 1/2 and it is in state D with probability 1/2. Thus, after one step, 
the average time to hit F is the average time starting from B, with probability 1/2, 
and it is the average time starting from D, with probability 1/2. 


Fig. 1.11 This is NOT what 
we mean by hitting time! 
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This situation is similar to the following one. You flip a fair coin. If the outcome 
is heads you get a random amount of money equal to X and if it is tails you get a 
random amount Y. On average, you get 


1 1 

~E(X)+=E(Y). 

JEO Tz EQ) 
Similarly, we can see that 


P(B) = 1+ BC) 
B(C) = 1+ B(A) 

1 1 1 
B(D) = 1+ 3 P(A) F 30» * qe) 
B(E) = 0. 


These equations, together with (1.7), are called the first step equations (FSE). 
Solving them, we find 


B(A) = 17, B(B) = 19, B(C) = 18, B(D) = 13 and B(E) = 0. 


1.5.2 Probability of Hitting a State Before Another 


Consider once again the same situation but say that we are interested in the proba- 
bility that starting from A we visit state C before E. We write this probability as 


a (A) = P[Tc < Tg | Xo = A]. 


As in the previous case, it turns out that we need to calculate o (7) fori = 
A, B, C, D, E. We claim that 


1 1 
a(A) = 5045) + 540». (1.8) 


To see this, note that, starting from A, after one step you are in state B with 
probability 1/2 and you will then visit C before E with probability a (B). Also, 
with probability 1/2, you will be in state D after one step and you will then visit C 
before E with probability a (D). Thus, the event that you visit C before E starting 
from A is the union of two disjoint events: either you do that by first going to B or 
by first going to D. Adding the probabilities of these two events, we get (1.8). 
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Similarly, one finds that 


a (B) = a(C) 
a(C)=1 


a(D) = TY " E + ga) 
a (E) — 0. 


These equations, together with (1.8), are also called the first step equations. Solving 
them, we find 


a(A) = ME = 1,a(C) = 1,a(D) = sa) =0. 


1.5.3 FSE for Markov Chain 


Let us generalize this example to the case of a Markov chain on 2 = {1,2,..., N} 

with transition probability matrix P. Let T; be the hitting time of state i. For a set 

A C & of states, let TA = min{n > 0| X(n) € A} be the hitting time of the set A. 
First, we consider the mean value of T4. Let 


BG) = E[TA| Xo =i], i e X. 
The FSE are 


Ba) = C PUERO ifi¢ A 


ifi € A. 


Second, we study the probability of hitting a set A before a set B, where A, B C 
X and AN B= Ø. Let 


a(i) = P[TA < Tg | Xo- i, i e X. 
The FSE are 
2. PG, j)a(j), ifi ¢ AUB 


a (i) — 4 1, ifie A 
0, ifi e B. 
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Third, we explore the value of 


Ta 
Y = Y h(X(n)). 


n=0 


That is, you collect an amount A (i) every time you visit state i, until you enter set 
A. Let 


y(i) = E[Y | Xo 2 i, i e X. 
The FSE are 


hG) - 5, PG, vG), ifi £A 


h(i), ifi € A. (e2 


y(i) = 


Fourth, we consider the value of 
TA 
Z 2 M Bh), 
n=0 
where P can be thought of as a discount factor. Let 
(i) = E[Z | Xo =i]. 
The FSE are 


50) = h()- B», P. 5G), ifi £A 
h(i), ifi € A. 

Hopefully these examples give you a sense of the variety of questions that can be 
answered for finite Markov chains. This is very fortunate, because Markov chains 
can be used to model a broad range of engineering and natural systems. 


1.6 Summary 


* Markov Chains: states, transition probabilities, irreducible, aperiodic, invari- 
ant distribution, hitting times; 

* Strong Law of Large Numbers; 

* Big Theorem: irreducible implies unique invariant distribution equal to the 
long-term fraction of time in the states; convergence to invariant distribution 
if irreducible and aperiodic; 

* Hitting Times: first step equations. 
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1.6.1 Key Equations and Formulas 


Definition of MC P[X(n4- 1) 2 jjX(n) =i, X(m),m < n] = PG, j) (1.3) 
P.m.f. of Xn Ty = mP" (1.5) 
Balance Equations zxP-—m (1.1) 
First Step Equations y(i) -h)t; P(i, j)yQ) (1.9) 


1.7 References 


There are many excellent books on Markov chains. Some of my favorites are 
Grimmett and Stirzaker (2001) and Bertsekas and Tsitsiklis (2008). The original 
patent on PageRank is Page (2001). The online book Easley and Kleinberg (2012) 
is an inspiring discussion of social networks. Chapter 14 of that reference discusses 
PageRank. 


1.8 Problems 


Problem 1.1 Construct a Markov chain that is not irreducible but whose distribu- 
tion converges to its unique invariant distribution. 


Problem 1.2 Show a Markov chain whose distribution converges to a limit that 
depends on the initial distribution. 


Problem 1.3 Can you find a finite irreducible aperiodic Markov chain whose 
distribution does not converge? 


Problem 1.4 Show a finite irreducible aperiodic Markov chain that converges very 
slowly to its invariant distribution. 


Problem 1.5 Show that a function Y (n) = g(X (n)) of a Markov chain X (n) may 
not be a Markov chain. 


Problem 1.6 Construct a Markov chain that is a sequence of 1.i.d. random variables. 
Is it irreducible and aperiodic? 
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Fig. 1.12 Markov chain for a b 
Problem 1.7 
0 
ppd p 2 


Problem 1.7 Consider the Markov chain X(n) with the state diagram shown in 
Fig. 1.12 where a, b € (0, 1). 


(a) Show that this Markov chain is aperiodic; 

(b) Calculate P[X (1) = 1, X2) 20, X (3) 20, X(4) = 1 | X(0) = 0]; 
(c) Calculate the invariant distribution; 

(d) Let T; = min(n > 0| X(n) = i). Calculate E[75 | X(0) = 1]. 


Problem 1.8 Use Python to write a simulator for a Markov chain (X (n), n > 1} 
with K states, initial distribution zr, and transition probability matrix P. The 


program should be able to do the following: 


1. Plot (X(n), n =1,..., N}; 


2. Plot the fraction of time that X (n) is in some chosen states during (1,2, ..., m} 
as a function of m, for m = 1,..., N; 
3. Plot the probability that X (n) is equal to some chosen states, for n = 1,..., N; 


4. Use this program to simulate a periodic Markov chain with five states; 
5. Use the program to simulate an aperiodic Markov chain with five states. 


Problem 1.9 Use your simulator to simulate the Markov chains of Figs. 1.2 and 1.6. 
Problem 1.10 Find the invariant distribution for the Markov chains of Fig. 1.6. 


Problem 1.11 Calculate d(1), d(2), and d(3), defined in (1.6), for the Markov 
chains of Fig. 1.6. 


Problem 1.12 Calculate d (A), defined in (1.6), for the Markov chain of Fig. 1.2. 


Problem 1.13 Let (X,,n > 0} be a finite Markov chain. Assume that it has 
a unique invariant distribution 2 and that z, converges to zr for every initial 
distribution 7o. Then (choose the correct answers, if any) 


e X, is irreducible; 

e Xn is periodic; 

* Xn is aperiodic; 

* Xn might not be irreducible. 
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Problem 1.14 Consider the Markov chain {X,,n > 0] on (0, 1} with P(0, 1) = 
0.1 and P(1, 0) = 0.3. Then (choose the correct answers, if any) 


* The invariant distribution of the Markov chain is [0.75, 0.25]; 
* Let Tj = min{n > 0|X, = 1}. Then E[Ti| Xo = 0] = 1.2; 
* E[X, + X2|Xo = 0] = 0.8. 


Problem 1.15 Consider the MC with the state transition diagram shown in 
Fig. 1.13. 


(a) What is the period of this MC? Explain. 

(b) Find all the invariant distributions for this MC. 

(c) Does 715, the distribution of X,, converge as n — oo? Explain. 

(d) Do the fractions of time the MC spends in the states converge? If so, what is the 
limit? 


Problem 1.16 Consider the MC with the state transition diagram shown in 
Fig. 1.14. 


(a) Find all the invariant distributions of this MC. 
(b) Assume 7r9(3) = 1. Find lim, zy. 


Problem 1.17 Consider the MC with the state transition diagram shown in 
Fig. 1.15. 


(a) Find all the invariant distributions of this MC. 
(b) Does 7r, converge as n — oo? If it does, prove it. 
(c) Do the fractions of time the MC spends in the states converge? Prove it. 


Fig. 1.13 MC for 1 0.4 
Problem 1.15 d wy D 
Aao 4 
0.6 1 
Fig. 1.14 MC for 06 0.4 0.2 1 0.7 
Problem 1.16 
1 0.5 D 0.3 0.3 
Fig. 1.15 MC for 0.6 0.4 0.1 0.2 1 0.7 
Problem 1.17 : A079 
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Fig. 1.16 MC for 0.7 
Problem 1.18 


Fig. 1.17 MC for 0.7 
Problem 1.19 


Problem 1.18 Consider the MC shown in Fig. 1.16. 


(a) Find the invariant distribution z of this Markov chain. 

(b) Calculate the expected time from 0 to 2. 

(c) Use Python to plot the probability that, starting from 0, the MC has not reached 
2 after n steps. 

(d) Use Python to simulate the MC and plot the fraction of time that it spends in the 
different states after n steps. 

(e) Use Python to plot 7n. 


Problem 1.19 For the Markov chain (X,, n > 0} with transition diagram shown in 
Fig. 1.17, assume that Xo = O. Find the probability that X; hits 2 before it hits 1 
twice. 


Problem 1.20 Draw an irreducible aperiodic MC with six states and choose the 
transition probabilities. Simulate the MC in Python. Plot the fraction of time in the 
six states. Assume you start in state 1. Plot the probability of being in each of the 
six states. 


Problem 1.21 Repeat Problem 1.20, but with a periodic MC. 


Problem 1.22 How would you trick the PageRank algorithm into believing that 
your home page should be given a high rank? 


Hint Try adding another page with suitable links. 
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Problem 1.23 Show that the holding time of a state is geometrically distributed. 


Problem 1.24 You roll a die until the sum of the last two rolls is exactly 10. How 
many times do you have to roll, on average? 


Problem 1.25 You roll a die until the sum of the last three rolls is at least 15. How 
many times do you have to roll, on average? 


Problem 1.26 A doubly stochastic matrix is a nonnegative matrix whose rows and 
columns add up to one. Show that the invariant distribution is uniform for such a 
transition matrix. 


Problem 1.27 Assume that the Markov chain (c) of Fig.1.6 starts in state 1. 
Calculate the average number of times it visits state 1 before being absorbed in 
state 3. 


Problem 1.28 A man tries to go up a ladder that has N rungs. Every step he makes, 
he has a probability p of dropping back to the ground and he goes up one rung 
otherwise. Use the first step equations to calculate analytically the average time he 
takes to reach the top, for N = 1,...,20 and p = 0.05, 0.1, and 0.2. Use Python 
to plot the corresponding graphs. 


Problem 1.29 Let (X, n > 0} be a finite irreducible Markov chain with transition 
probability matrix P and invariant distribution 7. Show that, for all i, j, 


N-1 
» UX, =i, X412 j} > x()P(, j), w.p. las N > oo. 
n=0 


1 
N 
Problem 1.30 Show that a Markov chain [X,, n > 0] can be written as 
Xn+1 = f (Xn, Vn), n = 0, 
where the V, are i.i.d. random variables independent of Xo. 


Problem 1.31 Let P and P be two stochastic matrices and v a pmf on the finite set 
2. Assume that 


z()P(,j)—n(g)B(,iVije X. 


Show that z is invariant for P. 


Problem 1.32 Let X, be a Markov chain on a finite set 2%. Assume that the 
transition diagram of the Markov chain is a tree, as shown in Fig. 1.18. Show that if 
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Fig. 1.18 A transition © 
diagram that is a tree 


x is invariant and if P is the transition matrix, then it satisfies the following detailed 
balance equations: 


n (i) P(, j) 2 n G)PG. i), Vi, j. 


Problem 1.33 Let X, be a Markov chain such that X, has the invariant distribution 
x and the detailed balance equations are satisfied. Show that 


P(Xooxo, X1=x1, ..., Xn=Xn)=P(Xw = xo, XN-1 = X1, ..., XN-n = Xn) 


for all n, all N > n, and all xo, .. . , xn. Thus, the evolution of the Markov chain in 
reverse time (N, N 1, N —2, ..., N —n) cannot be distinguished from its evolution 
in forward time (0, 1, ..., n). One says that the Markov chain is time-reversible. 


Problem 1.34 Let {X,, > 0} be a Markov chain on (—1, 1} with P(—1, 1) = 
Pd, —1) = a fora given a c (0, 1). Define 


Y, = Xo c Xn n = 0. 


(a) Is {Y,, > 0} a Markov chain? Prove or disprove. 
(b) How would you calculate 


E[t|Yo = 1] where t = min{n > 0| Y, = —5 or Y, = 30}? 


Problem 1.35 You flip a fair coin repeatedly, forever. Show that the probability that 
the number of heads is always ahead of the number of tails is zero. 
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Background: 


* Borel-Cantelli (B.1.2); 

* monotonicity of expectation (B.2); 

* convergence of expectation (B.8)-(B.9); 

* properties of variance: (B.3) and Theorem B.4. 


2.1 Sample Space 


Let us connect the definition of X = (X,, n > 0} of a Markov chain with the general 
framework of Sect. B.1. (We write X, or X(n).) In that section, we explained 
that a random experiment is described by a sample space. The elements of the 
sample space are the possible outcomes of the experiment. A probability is defined 
on subsets, called events, of that sample space. Random variables are real-valued 
functions of the outcome of the experiment. 

To clarify these concepts, consider the case where the X, are i.i.d. Bernoulli 
random variables with P(X, = 1) = P(X, = 0) = 0.5. These random variables 
describe flips of a fair coin. The random experiment is to flip the coin repeatedly, 
forever. Thus, one possible outcome of this experiment is an infinite sequence of 
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0’s and 1’s. Note that an outcome is not 0 or 1: it is an infinite sequence since the 
outcome specifies what happens when we flip the coin forever. Thus, the set 92 of 
outcomes is the set (0, 1}°° of infinite sequences of 0’s and 1’s. If w is one such 
sequence, we have w = (wo, @1,...) where œn € (0, 1}. It is then natural to define 
Xn(@) = wn, which simply says that X, is the outcome of flip n, for n > 0. Hence 
Xp (c) € Ñ for all w € 2 and we see that each X, is a real-valued function defined 
on £2. For instance, X9(1101001...) = 1 since wm = 1 when o = 1101001.... 
Similarly, X;(1101001...) = 1 and X2(1101001...) = 0. To specify the random 
experiment, it remains to define the probability on £2. The simplest way is to say 
that 


P({@|@o = a, «1 = b,...,@n = Z}) 


= P(Xo =a,...,X,; = z) =1/2""! 
for all n > Oanda,b,...,z € (0, 1}. For instance, 


P({@|@o = 1}) = P(Xo = 1) = 1/2. 


Similarly, 


P({o|@o = 1, w = 0) = P(Xo = 1, X1 20) = 1/4. 


Observe that we define the probability of a set of outcomes, or event, {w|wọo = 
a,@, = b,...,@, = z} instead of specifying the probability of each outcome w. 
The reason is that the probability that we observe a specific infinite sequence of 0’s 
and 1’s is zero. That is, P({w}) = 0 for all w € £2. Such a description does not tell 
us much about the coin flips! For instance, it does not specify the bias of the coin, or 
the fact that successive flips are independent. Hence, the correct way to proceed is to 
specify the probability of events, that are sets of outcomes, instead of the probability 
of individual outcomes. 

For a Markov chain, there is some sample space £2 and each X, is a function 
Xn (æ) of the outcome o that takes values in 2°. A probability is defined on subsets 
of 2. 

In this example, one can choose {2 to be the set of possible infinite sequences 
of symbols in 2°. That is, 2 = X and an element w € 2 is o = (09,0, ...) 
with œ, € X for n > 0. With this choice, one has X,(@) = œn for n > 0 and 
w € Q, as shown in Fig. 2.1. This choice of 2, similar to what we did for the coin 
flips, is called the canonical sample space. Thus, an outcome is the actual sequence 
of values of the Markov chain, called the trajectory, or realization of the Markov 
chain. It remains to specify the probability of event in 2. The trick here is that the 
probability that the Markov chain follows a specific infinite sequence is 0, similarly 
to the probability that coin flips follow a specific infinite sequence such as all heads. 
Thus, one should specify the probability of subsets of £2, not of individual outcomes. 
One specifies that 
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Fig. 2.1 In the canonical A . X(w,n) 

sample space, the outcome w A i | 

is the trajectory of the 

Markov chain J | 
— 
n 

P(Xo = ig, X1 =11,...,Xn = in) 
= Mo(io)P (io, i1) X +++ x P(in-1, in), (2.1) 
for all n > 0 and ig, i1, ..., i; in 2. Here, z(ig) is the probability that the Markov 


chain starts in state ig. 
This identity is equivalent to (1.3). Indeed, if we let 


An = (Xo = ig, X1 = ij, ..., Xn — is] 


and 


An-1 = (Xo = ig, X1 = ip, +- -3 Xn—1 = in-1}, 
then 
P(An) = P[An|An—1]P(An-1) = P(An-1) P(in-1, in), 


by (1.3), so that (2.1) holds by induction on n. 

Thus, one has defined the probability of events characterized by the first n + 1 
values of the Markov chain. It turns out that there is one probability on {2 that is 
consistent with these values. 


2.2 Laws of Large Numbers for Coin Flips 


Before we discuss the case of Markov chains, let us consider the simpler example of 
coin flips. Let then {X,, > 0) be i.i.d. Bernoulli random variables with P(X, = 
0) = P(X, = 1) = 0.5, as in the previous section. We think of X, = 1 if flip 
yields heads and X,, — O if it yields tails. We want to show that, as we keep flipping 
the coin, the fraction of heads approaches 5096. There are two statements that make 
this idea precise. 
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2.2.4 Convergence in Probability 


The first statement, called the Weak Law of Large Numbers (WLLN), says that it is 
very unlikely that the fraction of heads in n coin flips differs from 5096 by even a 
small amount, say 1%, if n is large. For instance, let n = 10°. We want to show that 
the likelihood that the fraction of heads among 10° flips is more than 51% or less 
than 4946 is small. Moreover, this likelihood can be made as small as we wish if we 
flip the coin more times. 

To show this, let 


Xo +--+ + Xn-1 
n 


Y, = 


be the fraction of heads in the first n flips. We claim that 


var(Y,) 
P(|Y4 — E(Y4)) = €) < —5—. (2.2) 
This result is called Chebyshev's inequality (Fig. 2.2). 
To see (2.2), observe that! 
Y, — E (Yp)? 
i-ipqgce Q3) 


e 


Indeed, if |Y, — E(Y,)| > e, then (Y, — E(Y;))?? > €?, so that if the left-hand side 
of inequality (2.3) is one, the right-hand side is at least equal to one. Also, if the 
left-hand side is zero, it is less than or equal to the right-hand side. Thus, (2.3) holds 
and (2.2) follows by taking the expected values in (2.3), since E(14) — P(A) and 
E((Y, — E(Yn))*) = var(Y,) and since expectation is monotone (B.2). 


Fig. 2.2 Pafnuty Chebyshev. 
1821-1884 


!By definition, 1(C] takes the value 1 if the condition C holds and the value 0 otherwise. 
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Now, E(Y,) = 0.5 and 


var(Xo +--- + X41) _ nvar(Xo) 


var(Y4) = j 2 


n n 


To see this, recall that if one multiplies a random variable by a, its variance is 
multiplied by a? (see (B.3)). Also, the variance of a sum of independent random 
variables is the sum of their variances (see Theorem B.4). Hence, 


var(Xo) 


P(\¥n — 0.5| > €) < 2 
ne 


Since Xo =p B(0.5), we find that 


var(Xo) = E(X$) — (EX 
= E(Xo) — (E(X9)? 20.5 — 0.25 = 025. 


Thus, 


1 
P(\¥n —0.5| > €) x dee 


In particular, if we choose e = 1% = 0.01, we find 


2, 500 
P(|Y, — 0.5] > 1%) < 2-— = 0.025 with n = 10°. 
n 


More generally, we have shown that 
P(|Y, —0.5| > e) — Oas n > œ, Ve > 0. 


This is the WLLN. 


2.2.20 Almost Sure Convergence 


The second statement is the Strong Law of Large Numbers (SLLN). It says that, 
for all the sequences of coin flips we will ever observe, the fraction Y, actually 
converges to 5096 as we keep on flipping the coin. 

There are many sequences of coin flips for which the fraction of heads does not 
approach 50%. For instance, the sequence that yields heads for every flip is such 
that Y, = 1 for all n and thus Y, does not converge to 50%. Similarly, the sequence 
001001001001001 . .. is such that Y,, approaches 1/3 and not 50%. What the SLLN 
implies is that all those sequences such that Y, does not converge to 50% have 
probability 0: they will never be observed. 
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Thus, this statement is very deep because there are so many sequences to rule 
out. Keeping track of all of them seems rather formidable. Indeed, the proof of this 
statement is quite clever. Here is how it proceeds. Note that 


Y„ — 0.514 
P(Y, — 0.5| > €) < E (E ,Vn,e > 0. 
€ 
Indeed, 
Yo 5 
iess eaga e 
€ 


and the previous inequality follows by taking expectations. Now, 


= Mes = 4 
(x, - 0.51) = £ ( 0.5) +-+- + (Xn-1 =); 


n4 
Also, with Zm = Xm — 0.5, one has 


4 


n—l 
E((Xo — 0.5) -----(X,.1—0.5)) = E (È 2 


m=0 


=E » ZEE SOT. 
a,b,c,d 


where the sum is over all a,b,c,d € (0,1,...,n — 1}. This sum consists of n 
terms Z2, n(n — 1) terms zZz? with a Æ b and other terms where at least a 
factor Za is not repeated. The latter terms have zero-mean since E(ZaZpZcZaąa) = 
E(Za4) E(Zp Ze Z4) = 0, by independence, whenever b, c, and d are all different 
from a. Consequently, 


E » ZaZpZ.Za | = nE(Z$) + n(n - DE(ZGZ) = na + n(n — DB 
a,b,c,d 


with a = E(Z9) and B = E(Z2 Z2). Hence, substituting the result of this 
calculation in the previous expressions, we find that 


no4n(n—DB n’ (a+) oa-p 
PE, — 0.5| > €) < niel SII . 
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This inequality implies that? 


XO Püy,-0.5| = €) < oc. 


n>1 


This expression shows that the events A, := {|Y, —0.5| > e€} have probabilities that 
add up to a finite number. From the Borel-Cantelli Theorem B.1, we conclude that 


P(A,, 1.0.) = 0. 
This result says that, with probability one, w belongs only to finitely many A;'s. 
Hence,’ with probability one, there is some n(@) so that o ¢ Aj forn > n(w). That 
is, 
[Yn (œ) — 0.5| < €, Yn > n(o). 


Since this property holds for an arbitrary € > 0, we conclude that, with probability 
one, 


Y,(@) — 0.5 as n > oo. 
Indeed, if Y,(@) does not converge to 50%, there must be some € > 0 so that 
|Y, — 0.5| > € for infinitely many n’s and we have seen that this is not the case. 
2.3 Laws of Large Numbers for i.i.d. RVs 


The results that we proved for coin flips extend to i.i.d. random variables {X}, > 
0} to show that 


Xo c d Xn-1 
n 


Y, :— 


approaches E(Xo) as n — oco. As for coin flips, there are two ways of making that 
statement precise. 


?Recall that 


3Let n(w) — 1 be the largest n such that w € Ay. 
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2.3.1 WeakLaw of Large Numbers 
We need a definition. 


Definition 2.1 (Convergence in Probability) Let X,,n > 0 and X be random 
variables defined on a common probability space. One says that X, converges in 


probability to X, and one writes X; m X if, for all e > 0, 


P(|X, — X| > €) > Oas n > oo. 


The Weak Law of Large Numbers (WLLN) is the following result. 


Theorem 2.1 (Weak Law of Large Numbers) Let (X,,n > 0} be a sequence of 
i.i.d. random variables with mean u. Then 


Xo ++ + Xia p 
n 


Y, = 


(2.4) 


Proof Assume that E(X 2) « oo. The proof is then the same as for coin flips and is 
left as an exercise. For the general case, see Theorem 15.14. oO 


The first result of this type was proved by Jacob Bernoulli (Fig. 2.3). 


2.3.2 Strong Law of Large Numbers 
We again need a definition. 


Definition 2.2 (Almost Sure Convergence) Let X,,n > 0 and X be random 


variables defined on a common probability space. One says that X, converges 
almost surely to X as n — oo, and one writes X, — X, a.s. if 


Fig. 2.3 Jacob Bernoulli. 
1655-1705 
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Fig. 2.4 When rolling a 45 
balanced die, the sample ii YG 
mean converges to 3.5 Ap | 
1: | — ae | 
; + 
| n 


P ( tim. X y X(»)) =i, 


Thus, this convergence means that the sequence of real numbers X,,(@) converges 
to the real number X (œw) as n — oco, with probability one. 

Let (X4, n > 0} be as in the statement of Theorem 2.1. We have the following 
result. 


Theorem 2.2 (Strong Law of Large Numbers) Let (X,,n > 0] bea sequence of 
i.i.d. random variables with mean u. Then 


X ed Xn- 
oF tel > fas n — oo, with probability 1. 
n E 


Thus, the sample mean values Y, :— (Xo + --- + Xn—1)/n converge to the 
expected value, with probability 1. (See Fig. 2.4.) 


Proof Assume that 
E(X7) < œ. 


The proof is then the same as for coin flips and is left as an exercise. The proof of 
the SLLN in the general case is given in Theorem 15.14. o 


Figure 2.5 illustrates the SLLN and WLLN. The SLLN states that the sample 
means of i.i.d. random variables converge to the mean, with probability one. The 


^ Almost sure convergence implies convergence in probability, so SLLN is stronger than WLLN. 
See Problem 2.5. 
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Fig. 2.5 SLLN and WLLN SLLN: Every observed realization goes to 0.5 
for i.i.d. U[0, 1] random WLLN: The fraction of realizations 
variables away from 0.5 shrinks 


10 different simulations 
with i.i.d. U[0, 1] random 
variables. 


100 200 300 400 500 


WLLN says that as the number of samples increases, the fraction of realizations 
where the sample mean differs from the mean by some amount gets small. 


2.4 Law of Large Numbers for Markov Chains 


The long-term fraction of time that a finite irreducible Markov chain spends in a 
given state is the invariant probability of that state. For instance, a Markov chain 
X (n) on (0, 1) with P(0, 1) 2 a — P(1,0) with a € (0, 1] spends half of the time 
in state O, in the long term. The Markov chain in Fig. 1.2 spends a fraction 12/39 of 
the time in state A, in the long term. 

To understand this property, one should look at the returns to state i, as shown 
in Fig.2.6. The figure shows a particular sequence of values of X(n) and it 
decomposes this sequence into cycles between successive returns to a given state 
i. A new cycle starts when the Markov chain comes back to i. The durations of 
these successive cycles, T1, 75, T3, .. ., are independent and identically distributed, 
because the Markov chains start afresh from state i at each time T; , independently 
of the previous states. This is a consequence of the Markov property for any given 
value k of T, and of the fact that the distribution of the evolution starting from state 
i at time k does not depend on k. 

It is easy to see that these random times have a finite mean. Indeed, fix one state 
i. Then, starting from any given state j, there is some minimum number M ; of steps 
required to go to state i. Also, there is some probability p; that the Markov chain 
will go from j to i in Mj steps. Let then M = max; Mj and p = min; pj. We can 
then argue that, starting from any state at time 0, there is at least a probability p that 
the Markov chain visits state i after at most M steps. If it does not, we repeat the 
argument starting at time M. We conclude that 7; < Mr where t is a geometric 
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X(n) 


» n 


y Tə T; f, Ts 


Fig. 2.6 The cycles between returns to state i are i.i.d. The law of large numbers explains the 
convergence of the long-term fraction of time to a constant 


random with parameter p. Hence E(T;) < ME(t) = M/p < œ, as claimed. Note 
also that E(T;) < M‘*E(t*) < oc. 
The Strong Law of Large Numbers states that 


Ti + Ty +--+ + Ty 
k 


> E(T]), as k — oo, with probability 1. (2.5) 
Thus, the long-term fraction of time that the Markov chain spends in state i is 
given by 


k 
lim = 3 
k—oo Ti + Ta +---+ Tk E(T|) 


with probability 1. (2.6) 


Let us clarify why (2.6) implies that the fraction of time in state i converges to 
1/E(T|). Let A(n) be the number of visits to state i by time n. We want to show 
that A(n)/n converges to 1/ E(T1). Then, 


k AQ) k. k 
— — 
Ti cc Thay n n Ti+ + Tk 


whenever Ti +--+ Tk € n < Ti +---+ Toa. If we believe that Tk+1/k — 0 as 
k — œ, the inequality above shows that 


A(n) 1 
=> i 
n E(T|) 


as claimed. To see why Tk+1/k goes to zero, note that 


T, M 
p (Zu >e) s P(E >e) s PE mat <a- p 
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Thus, by Borel-Cantelli Theorem B.1, the event Tk+1/k > «€ occurs only for 
finitely many values of k, which proves the convergence to zero. 


2.5 Proof of Big Theorem 


This section presents the proof of the main result about Markov chains. 


2.5.1 Proof of Theorem 1.1 (a) 
Let m ; be the expected return time to state j. That is, 


m; = E[T;|X (0) = j] with T; = min{n > 0|X (n) = j}. 


We show that x (j) = 1/mj, j = 1,..., N is the unique invariant distribution if the 
Markov chain is irreducible. 
During n = 1,..., N where N > 1, the Markov chain visits state j a fraction 


1/m; of the times. A fraction P (j, i) of those times, it visits state i just after visiting 
state j. Thus, a fraction (1/m;)P(j, i) of the times, the Markov chain visits j then 
i in successive steps. By summing over j, we find the fraction of the times that the 
Markov chain visits i. Thus, 


1 m 1 
) — P(j, i) = —. 
m; mi 


* t 
jo 


Hence, there is one invariant distribution x and it is given by x; = 1/m;, which is 
the fraction of time that the Markov chain spends in state i. 

To show that the invariant distribution is unique, assume that there is another one, 
say @(i). Start the Markov chain with that distribution. Then 


N-1 


> 1(X(n) =i} > x(i). 


n=0 


However, taking expectation, we find that the left-hand side is equal to $ (i). Thus, 
$ = z and the invariant distribution is unique.? 


Indeed, 


EX (n) = ij) = P(X) = i) = 90). 
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Fig. 2.7 An aperiodic 0.7 
Markov chain 


2.5.2 Proof of Theorem 1.1 (b) 


If the Markov chain is irreducible but not aperiodic, then zr, may not converge to 
the invariant distribution zr. For instance, if the Markov chain alternates between 0 
and 1 and starts from 0, then z, = [1,0] for n even and z, = [0, 1] for n odd, so 
that zr, does not converge to x = [0.5, 0.5]. 

If the Markov chain is aperiodic, 7, —> m. Moreover, the convergence is 
geometric. We first illustrate the argument on a simple example shown in Fig. 2.7. 
Consider the number of steps to go from 1 to 1. Note that 


{n > 0|P"(1, 1) > 0} = {3, 4, 6, 7, 8, 9, 10, .. .). 


Thus, P’(1,1) > Oifn > 6. Now, P[X(2) = 1|X(0) = 2] > O, so that 
P[X(n) = 1|X (0) = 2] > 0 forn > 8. Indeed, if n > 8, then X can go from 2 to 
1 in two steps and then from 1 to 1 in n — 2 steps. The argument is similar for the 


other states and we find that there is some M > 0 and some p > 0 such that 
P[X(M) 21|X(0) =i] > p,i = 1,2, 3,4. 
Now, consider two copies of the Markov chain: (X (n), n > 0) and (Y (n), n > 0]. 
One chooses X (0) with distribution zp and Y (0) with the invariant distribution 7r. 
The two Markov chains evolve independently initially. We define 
t = min(n > O|X (n) = Y (n)}. 
In view of the observation above, 
P(X(M) = 1 and Y(M) = 1) > p°. 
Thus, P(t > M) < 1— p°. Ift > M, then the two Markov chains have not met yet 


by time M. Using the same argument as before, we see that they have a probability 
at least p? of meeting in the next M steps. Thus, 


P(t > kM) < (1- p). 
Now, modify X (n) by gluing it to Y (n) after time t. This coupling operation does 


not change the fact that X (n) still evolves according to the transition matrix P, so 
that P(X (n) = i) = nn (i) where nn = xoP". 


34 2 PageRank B 


Now, 


XIP N) =i) - PON) = | < 2P(X (n) + Y (0) < 2P( > n). 


t 


Hence, 


Yo ltr) — 2) x2PG > n), 


and this implies that 


Y m6) - x6) < 2(1— p?) ifn > kM. 


To extend this argument to a general aperiodic Markov chain, we need the fact 
that for each state i there is some integer n; such that P"(i, i) > 0 for all n > nj. 
We prove that fact as Lemma 2.3 in the following section. 


2.5.3 Periodicity 
We start with a property of the set of return times of an irreducible Markov chain. 


Lemma 2.1 Fix a state i and let S := (n > 0|P"(i, i) > 0} and d = g.c.d.(S). 
There must be two integers n and n + d in the set S. 


Proof The trick is clever. We first illustrate it on an example. Assume S = 
(9, 15, 21, ...) with d = g.c.d.(S) = 3. There must bea, b € S with g.c.d.{a, b} = 
3. Otherwise, the gcd of S would not be 3. Here, we can choose a = 15 and b = 21. 
Now, consider the following operations: 


(a, b) = (15, 21) > (6, 15) > (6,9) > (3, 6) > (3,3). 


At each step, we go from (x, y) with x < y to the ordered pair of (x, y — x}. Note 
that at each step, each term in the pair (x, y) is an integer linear combination of a 
and b. For instance, (6, 15) = (b — a, a). Then, (6,9) = (b — a,a — (b — a)) = 
(b — a, 2a — b), and so on. Eventually, we must get to (3, 3). Indeed, the terms are 
always decreasing until we get to zero. Assume we get to (x, x) with x Z 3. At the 
previous step, we had (x, 2x). The step before must have been (x, 3x), and so on. 
Going back all the way to (a, b), we see that a and b are both multiples of x. But 
then, g.c.d.{a, b} = x, a contradiction. 

From this construction, since at each step the terms are integer linear combina- 
tions of a and b, we see that 


3=ma+nb 
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for some integers m and n. Thus, 


3=mtatntb—ma—n_b, 


where m* = max{m, 0} and m = m* — m7, and similarly for nt and n^ . Now we 
can choose 


N =m a-n band N 4-32 m*a 4 n*b. 


The last step of the argument is to notice that if a, b € S, then xa + Bb € S for any 
integers o and £ that are not both zero. This fact follows from the definition of S as 
the return times from i to i. Hence, both N and N + 3 are in S. 

The proof for a general set S with gcd equal to d is identical. o 


This result enables us to show that the period of a Markov chain is well-defined. 


Lemma 2.2 For an irreducible Markov chain, d(i) defined in (1.6) has the same 
value for all states. 


Proof Pick j + i. We show that d(j) < d(i). This suffices to prove the lemma, 
since by symmetry one also has d (i) < d(j). 

By irreducibility, P" (j,i) > 0 for some m and P" (i, j) > 0 for some n. Now, 
by definition of d (i) and by the previous lemma, there is some integer N such that 
P" (i, i) > 0 and PN+4 (i, i) > 0. But then, 


SEA j) > 0 and poete pentney. j) > 0. 


This implies that the integers K :— n + N + m and K + d(i) are both in S := (n > 
0| P" (j, j) > 0}. Clearly, this shows that 


d(j) := g.c.d.(S) x d(i). " 


The following fact then suffices for our proof of convergence, as we explained in 
the example. 


Lemma 2.3 Let X be an irreducible aperiodic Markov chain. Let S = (n > 
0| P" (i, i) > 0]. Then, there is some nj such that n € S, for all n > nj. 


Proof We know from Lemma 2.1 that there is some integer N such that N, N+1 € 
S. We claim that 


n € S, Yn > N?, 
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To see this, first note that for m > N — 1 one has 


mN+0=mN, 
mN+1=(m—-1)N+(N 41), 
mN +2=(m—2)N+2(N +1), 


"nn 


mN+N—-—1=(m—N+1)N+(N—-1)(N +1). 


Now, for n > N? one can write 


n=mN+k 
for some k € {0,1,..., N — 1} and m > N — 1. Thus, n is an integer linear 
combination of N and N + 1 that are both in S, so that n € S. oO 


2.6 Summary 


* Sample Space; 

* Laws of Large Numbers: SLLN and WLLN; 

* WLLN from Chebyshev's Inequality; 

* SLLN from Borel-Cantelli and fourth moment bound; 

e SLLN for Markov chains using the i.i.d. return times to a state; 
e Proof of Big Theorem. 


2.6.1 Key Equations and Formulas 


SLLN (Xi +- + X)/n > E(X1), w.p. 1 T.2.2 
Chebyshev P((QG +--+ +Xn)/n — pl > ©) < va(x)/e (2.2) 
Convergence in Prob. P(|X, -—X|>«) 7 0 D.2.1 
Borel-Cantelli YS P(An) < œ 2 P(An, io.) 20 T.B.1 


SLLN for MC (X1 =i} +--+ + 1{X, =i)/n > x) wp.1 Tii 
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2.7 References 


An excellent text on Markov Chains is Chung (1967). A more advanced text on 
probability theory is Billingsley (2012). 


2.8 Problems 


Problem 2.1 Consider a Markov chain X, that takes values in (0, 1}. Explain why 
(0, 1} is not its sample space. 


Problem 2.2 Consider again a Markov chain that takes values in (0,1) with 
P(0,1) — a and P(1,0) — b. Exhibit two different sample spaces and the 
probability on them for that Markov chain. 


Problem 2.3 Draw the smallest periodic Markov chain. Show that the fraction of 
time in the states converges but the probability of being in a state at time n does not 
converge. 


Problem 2.4 For the Markov chain in Problem 2.2, calculate the eigenvalues and 
use them to get a bound on the distance between the distribution at time n and the 
invariant distribution. 


Problem 2.5 Why does the strong law imply the weak law? More concretely, let 
Xn, X be random variables such that X, — X almost surely. Show that X, — X 
in probability. 


Hint Fix e > Oand define Z, = 1(|X,, — X| > €}. Use DCT to show that E(Z,) — 
0 as n > oo if X, — X almost surely. 


Problem 2.6 Draw a Markov chain with four states that is irreducible and aperi- 
odic. Consider two independent versions of the Markov chain: one that starts in 
state 1, the other in state 2. Explain what they will meet after a finite time. 

Problem 2.7 Consider the Markov chain of Fig. 1.2. Use Python to calculate the 


eigenvalues of P. Let A be the largest absolute value of the eigenvalues other than 
1. Use Python to calculate 


d(n) := Y MG) — mn Ò), 


where zo9(A) = 1. Plot d (n) and A" as functions of n. 
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Problem 2.8 You flip a fair coin. If the outcome is "head," you get a random 
amount of money equal to X and if it is“ tail," you get a random amount Y. Prove 
formally that on average, you get 


1 1 
;EO0 + SE). 


Problem 2.9 Can you find random variables that converge to 0 almost surely, but 
not in probability? 


Problem 2.10 Let {X,,, > 1} be i.i.d. zero-mean random variables with variance 
o?. Show that X„/n — 0 with probability one as n — oc. 


Hint Borel-Cantelli. 


Problem 2.11 Let X, be a finite irreducible Markov chain on 2% with invariant 
distribution z and f : 2 — Ñ some function. Show that 


N-1 
x b» fO) > m(i) f (i) w.p. 1, as N > oo. 
n=0 ie Ax 
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Application: Sharing Links, Multiple Access, Buffers 
Topics: Central Limit Theorem, Confidence Intervals, Queueing, Random- 
ized Protocols 


Background: 


* General RV (B.4) 


3.1 Sharing Links 


One essential idea in communication networks is to have different users share 
common links. 

For instance, many users are attached to the same coaxial cable; a large number 
of cell phones use the same base station; a WiFi access point serves many devices; 
the high-speed links that connect buildings or cities transport data from many users 
at any given time (Figs. 3.1 and 3.2). 

Networks implement this sharing of physical resources by transmitting bits that 
carry information of different users on common physical media such as cables, 
wires, optical fibers, or radio channels. This general method is called multiplexing. 
Multiplexing greatly reduces the cost of the communication systems. In this chapter, 
we explain statistical aspects of multiplexing. 
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Fig. 3.1 Shared coaxial 
cable for internet access, TV, 
and telephone 


IUS. 


Kn 


28252 


<—» Telephone 
-*«— Tv 
<> Internet 


Fig. 3.2 Cellular base 
station antennas 


Fig. 3.3 A random number v 
of connections share a link 
with rate C 


In the internet, at any given time, a number of packet flows share links. For 
instance, 20 users may be downloading web pages or video files and use the same 
coaxial cable of their service provider. 

The transmission control protocol (TCP) arranges for these different flows to 
share the links as equally as possible (at least, in principle). 

We focus our attention on a single link, as shown in Fig. 3.3. The link transmits 
bits at rate C bps. If v connections are active at a given time, they each get a rate 
C/v. We want to study the typical rate that a connection gets. The nontrivial aspect 
of the problem is that v is a random variable. 

As a simple model, assume that there are N >> 1 users who can potentially 
use that link. Assume also that the users are active independently, with probability 
p. Thus, the number v of active users is Binomial(N, p) that we also write as 
B(N, p). (See Sect. B.2.8.) 

Figure 3.4 shows the probability mass function for N — 100 and p — 0.1, 0.2, 
and 0.5. To be specific, assume that N — 100 and p — 0.2. The number v of active 
users is B(100, 0.2) that we also write as Binomial(100, 0.2). On average, there 
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p.m.f. of B(100,p) 
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Fig. 3.4 The probability mass function of the Binomial(100, p) distribution, for p — 0.1,0.2 
and 0.5 


p.p.f. of B(100,0.2) 
30 


29 
28 
27 
26 


25 
0.90 0.92 0.94 0.96 0.98 


Fig. 3.5 The Python tool “ppf” shows (3.1) 


are Np — 20 active users. However, there is some probability that a few more than 
20 users are active. We want to find a number m so that the likelihood that there are 
more than m active users is negligible, say 5%. Given that value, we know that each 
active user gets at least a rate C/m, with probability 95%. 

Thus, we can dimension the links, or provision the network, based on that value 
m. Intuitively, m should be slightly larger than the mean. Looking at the actual 
distribution, for instance, by using Python's “ppf” as in Fig. 3.5, we find that 


P(v < 27) = 0.966 > 95% and P(v < 26) = 0.944 < 95%. (3.1) 


Thus, the smallest value of m such that P(v < m) > 95% ism = 27. 
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To avoid having to use distribution tables or computation tools, we use the fact 
that the binomial distribution is well approximated by a Gaussian random variable 
that we discuss next. 


3.2 Gaussian Random Variable and CLT 
Definition 3.1 (Gaussian Random Variable) 


(a) A random variable W is Gaussian, or normal, with mean 0 and variance 1, and 
one writes W =p M (0, 1), if its probability density function (pdf) is fy where 


2 


: | z | 9 
ex x e. 
/2z p 2 


One also says that W is a standard normal random variable, or a standard 
Gaussian random variable (Named after C.F. Gauss, see Fig. 3.6). 

(b) A random variable X is Gaussian, or normal, with mean u and variance c?, and 
we write X =p V(, 07), if 


fw(x) = 


X — u--oW, 


where W =p M (0, 1). Equivalently,! the pdf of X is given by 


fx) = 


(x — uy? 
euer 


o 


Figure 3.7 shows the pdf of a //(0, 1) random variable W. Note in particular 
that 


Fig. 3.6 Carl Friedrich 
Gauss. 1777-1855 


1See (B.9). 
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1.96 


Fig. 3.7 The pdf of a / (0, 1) random variable 


P(W > 1.65) ~ 5%, P(W > 1.96) ~ 2.5% and P(W > 2.32) x 1%. (322) 


The Central Limit Theorem states that the sum of many small independent 
random variables is approximately Gaussian. This result explains that thermal noise, 
due to the agitation of many electrons, is Gaussian. Many other natural phenomena 
exhibit a Gaussian distribution when they are caused by a superposition of many 
independent effects. 


Theorem 3.1 (Central Limit Theorem) Let {X(n),n > 1} be i.i.d. random 


variables with mean E(X(n)) = n and variance var(X(n)) = o°. Then, as 
n — oo, 
X(1) +---+X(n) — 
ae TE (3.3) 
o/n 
a 


In (3.3), the symbol — means convergence in distribution. Specifically, if 
(Y (n), n > 1} are random variables, then Y (n) > .// (0, 1) means that 


P(Y(n) x x) > PW < x), vx ER, 


where W is a ~ (0, 1) random variable. We prove this result in the next chapter. 
More generally, one has the following definition. 


Definition 3.2 (Convergence in Distribution) Let (X(n),n > 1} and X be 
random variables. One says that X (n) converges in distribution to X, and one writes 
X (n) > X,if 
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P(X(n) € x) > P(X <x), forall x s.t. P(X = x)= 0. (3.4) 
o 


As an example, let X (n) = 3 + 1/n for n > 1 and X = 3. It is intuitively clear that 
the distribution of X (n) converges to that of X. However, 


P(X(n < 3)=0 A P(X <3)=1. 
But, 
P(X(n) < x) > P(X < x), Vx #3. 


This example explains why the definition (3.4) requires convergence of P (X(n) < 
x) to P(X < x) only for x such that P(X = x) = 0. 

How does this notion of convergence relate to convergence in probability and 
almost sure convergence? First note that convergence in distribution is defined even 
if the random variables X (n) and X are not on the same probability space, since it 
involves only the distributions of the individual random variables. One can show? 
that 


X(n) 3 X implies X(n) 5 X implies X(n) > X. 
Thus, convergence in distribution is the weakest form of convergence. 

Also, a fact that I find very comforting is that if X(n) = X, then one can 
construct random variables Y (n) and Y on the same probability space so that 
Y(n) =p X (n) and Y =p X and 

Y (n) — Y, with probability 1. 

This may seem mysterious but is in fact quite obvious. First note that a random 
variable with cdf F(-) can be constructed by choosing a random variable Z =p 
U [0, 1] and defining (see Fig. 3.8) 

X(Z) = inffx e | F(x) > Z}. 
Indeed, one then has P(X(Z) < a) = F(a) since X(Z) < a if and only if Z € 
[0, F(a)], which has probability F(a) since Z =p U[0, 1]. But then, if X (n) > X, 
we have Fx, (x) — Fx(x) whenever P(X = x) = 0, and this implies that 


X,(z) = inf(x € t | Fx, (x) > z} > X(z) = inf(x e R | F(x) > z}, 


for all z. 


?See Problems 2.5 and 3.9. 
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Fig. 3.8 If Z =p U[0, 1], 
then cdf of X(Z) is F 


Fig. 3.9 Comparing 
Binomial(100, 0.2) with 
Gaussian(20, 16) 


Binomial(100, 0.2) 
| 


3.2.1 Binomial and Gaussian 


Figure 3.9 compares the binomial and Gaussian distributions. 
To see why these distributions are similar, note that if X —p B(N, p), then one 
can write 


X=Y tec Yy, 


where the random variables Y,, are i.i.d. and Bernoulli with parameter p. Thus, by 
the CLT, 


X—Np 
JN 


where o? = var(Y|) = E(Y2) — (E(¥))* = p(1 — p). Hence, one can argue that 
1 


x N (0, o°), 


B(N, p) &p N (Np, No?) =p N (Np, Np(1 — p). (3.5) 


For p = 0.2 and N = 100, one concludes that B(100, 0.2) ~ ø (20, 16), which is 
confirmed by Fig. 3.9. 
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3.2.2 Multiplexing and Gaussian 

We now apply the Gaussian approximation of a binomial distribution to multiplex- 
ing. Recall that we were looking for the smallest value of m such that P(B(N, p) > 
m) < 5%. The ideas are as follows. From (3.5) and (3.2), we have 


(1) BIN, p) & V (Np, Np(1 — p), for N > 1; 

(2) P(N (u, 0?) > w+ 1.650) © 596. 
Combining these facts, we see that, for N > 1, 

P(B(N, p) > Np + 1.65 Np(1l— p)) © 5%. 
Thus, the value of m that we are looking for is 
m = Np + 1.65 Np(1 — p) = 20 + 1.65V 16 z 27. 

A look at Fig. 3.9 shows that it is indeed unlikely that v is larger than 27 when 
v =p B(100, 0.2). 
3.2.3 Confidence Intervals 
One can invert the calculation that we did in the previous section and try to guess p 
from the observed fraction Y (N) of active users out of N >> 1. From the ideas (1) 


and (2) above, together with the symmetry of the Gaussian distribution around its 
mean, we see that the events 


Ai = [B(N, p) = Np + 1.65y Np(d — p)j 


and 


A2 = (B(N, p) < Np — L65y Np(l — p)} 


each have a probability close to 5%. With Y (N) =p B(N, p)/N, we see that 


Ai = [ro ZZ LED 


and 


1 
A= [rsp ss e). 
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Hence, the event A; U A» has probability close to 10%, so that its complement has 
probability close to 9096. Consequently, 


P (ro~ 1.65 A P. p<Y(N)+ us rn) x~ 90%. 


We do not know p, but p(1 — p) < 1/4. Hence, we find 


1 


1 
P | Y(N) — 0.83 —— Y(N 0.83 
( (N) UNS (N) + ae 


) > 90%. 


For N = 100, this gives 
P(Y(N) — 0.08 < p € Y(N) + 0.08) > 90%. 


For instance, if we observe that 30% of the 100 users are active, then we guess 
that p is between 0.22 and 0.38, with probability 9096. In other words, [Y (N) — 
0.08, Y (N) + 0.08] is a 90%-confidence interval for p. 

Figure 3.7 shows that we can get a 5%-confidence interval by replacing 1.65 by 
2. Thus, we see that 


1 1 
Y(N)— ——, Y (N) + — 3.6 
[ren JN (| (3.6) 


is a 95%-confidence interval for p. 

How large should N be to have a good estimate of p? Let us say that we would 
like to know p plus or minus 0.03 with 95% confidence. Using (3.6), we see that we 
need 


1 
JN 


Thus, Y(1, 089) is an estimate of p with an error less than 0.03, with probability 
95%. Such results form the basis for the design of public opinion surveys. 

In many cases, one does not know a bound on the variance. In such situations, 
one replaces the standard deviation by the sample standard deviation. That is, for 
1.1.d. random variables (X (n), n > 1} with mean p, the confidence intervals for u 
are as follows: 


= 3%, i.e., N = 1,089. 


On 


Jn 


On 
Un — E Ha +2 


Ln — 1.65 , Un + 1.6525. — 9096 — Confidence Interval 
Jn 


On 


d = 95% — Confidence Interval, 
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where 


OX) pe XQ) 
E n 


n 


and 


ý n—i ^" n—i n i 


52 = Ema XM = Hn? 0n | ci X (mm)? àl 

What's up with this n — 1 denominator? You probably expected the sample 
variance to be the arithmetic mean of the squares of the deviations from the sample 
mean, i.e., a denominator n in the first expression for B. It turns out that to make the 
estimator such that E (o2) = o2, i.e., to make the estimator unbiased, one should 
divide by n — 1 instead of n. The difference is negligible for large n, obviously. 
Nevertheless, let us see why this is so. 

For simplicity of notation, assume that E(X (n)) = 0 and let o? = var(X(n)) = 
E(X (n?). Note that 


n^ E(X(Q) — us) 


= E((nX(1) - X(D - XQ) ---— X(D»)y)) 


= E((n - 1)°X(1)*) + E(XO))) +--+ + E(X(n))) 
= (n — 16? + (n — 1o? = n(n — Do?. 
For the second equality, note that the cross-terms E(X (i) X(j)) fori z j vanish 


because the random variables are independent and zero-mean. 
Hence, 


E((X(1) - us») = T and 3 / E((X(m) — uy) = (n — 1)0°. 
m=1 


Consequently, an unbiased estimate of o? is 


1 n 
of := I 2. E((X (m) — us). 


n— 


3.3 Buffers 


The internet is a packet-switched network. A packet is a group of bits of data 
together with some control information such as a source and destination address, 
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Fig. 3.10 A switch with 
multiple input and output 
ports 


somewhat like an envelope you send by regular mail (if you remember that). A host 
(e.g., a computer, a smartphone, or a web cam) sends packets to a switch. The switch 
has multiple input and output ports, as shown in Fig. 3.10. 

The switch stores the packets as they arrive and sends them out on the appropriate 
output port, based on the destination address of the packets. The packets arrive at 
random times at the switch and, occasionally, packets that must go out on a specific 
output port arrive faster than the switch can send them out. When this happens, 
packets accumulate in a buffer. Consequently, packets may face a queueing? delay 
before they leave the switch. We study a simple model of such a system. 


3.3.4 Markov Chain Model of Buffer 


We focus on packets destined to one particular output port. Our model is in 
discrete time. We assume that one packet destined for that output port arrives with 
probability à € [0, 1] at each time instant, independently of previous arrivals. The 
packets have random sizes, so that they take random times to be transmitted. We 
assume that the time to transmit a packet is geometrically distributed with parameter 
u and all the transmission times are independent. Let X, be the number of packets 
in the output buffer at time n, for n > 0. At time n, a transmission completes with 
probability u and a new packet arrives with probability A, independently of the past. 
Thus, X, is a Markov chain with the state transition diagram shown in Fig. 3.11. 


3 Queueing and queuing are alternative spellings; queueing tends to be preferred by researchers and 
has the peculiar feature of having five vowels in a row, somewhat appropriately. 
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0 1 n—1 n n+l 
: P2 : P2 p2 | P2 | P2 i P2 


l-— po 


Fig. 3.11 The transition probabilities for the buffer occupancy for one of the output ports 


In this diagram, 
p2 = X(1 — p) 
po = HU — A) 
Pi = l- po~ pr. 


For instance, p2 is the probability that one new packet arrives and that the 
transmission of a previous does not complete, so that the number of packets in the 
buffer increases by one. 


3.3.2 Invariant Distribution 
The balance equations are 


71 (0) = (1 — p2)x (0) + por (1) 
z(n)— poz(n—1)- pyr(n)-- pot(nt+1),1l<n<N-1 
T(N) = pan(N — D) + (1 — por (N). 

You can verify that the solution is given by 


m(i)-— z (0)p! , i —0,1,..., N where p := E 


PO 


Since the probabilities add up to one, we find that 


N F 1 
; —p 
«o- [rw MET 
i=0 


In particular, the average value of X under the invariant distribution is 
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——— expected backlog E[X] 


10 1 - 
= —— average backlog during {0,1,...,.n} . 

81 = * buffer backlog X[n] . 

61 
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Fig. 3.12 A simulation of the queue with A = 0.16, u = 0.20, and N = 20 


N N 
E(X) =) in(i) 2 1(0 > ip! 
i=0 i=0 
NoNt+1—(N 4- DgN +1 
(1 — oY1 — pt) 
p | p AA = pp) 
l1-p p-p  n-AÀ 


A 


, 


where the approximation is valid if p < 1,i.e., A < 44, and N > 1 sothat No" « 1. 

Figure 3.12 shows a simulation of this queue when A = 0.16, u = 0.20, and 
N = 20. It also shows the average queue length over n steps and we see that it 
approaches A(1 — u)/(u — A) = 3.2. Note that this queue is almost never full, 
which explains that one can let N — oo in the expression for E(X). 


3.3.3 Average Delay 


How long do packets stay in the switch? Consider a packet that arrives when there 
are k packets already in the buffer. That packet then leaves after k + 1 packet 
transmissions. Since each packet transmission takes 1/4 steps, on average, the 
expected time that the packet spends in the switch is (k + 1)/u. Thus, to find the 
expected time a packet stays in the switch, we need to calculate the probability $ (k) 
that an arriving packet finds k packets already in the buffer. Then, the expected time 
W that a packet stays in the switch is given by 


k+1 
w=) LL 
k>0 M 


The result of the calculation is given in the next theorem. 
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Theorem 3.2 [fi < n, one has 


1 l—-u 
W = -E(X) = ——. 
Xr A-— u 


Proof The calculation is a bit lengthy and the details may not be that interesting, 
except that they explain how to calculate $ (k) and that they show that the simplicity 
of the result is quite remarkable. 

Recall that $ (k) is the probability that there are k + 1 packets in the buffer after a 
given packet arrives at time n. Thus, $ (k) = P[X(n+1) = k+1 | A(n) = 1] where 
A (n) is the number of arrivals at time n. Now, if D(n) is the number of transmission 
completions at time n, 


(k) = P[X(n) =k +1, D(n) «1| A(n) = 1] 
+ P[X(n) =k, D(n) =0| A(n) = I]. 
Also, 


Par ae P[X(n) =k +1, D(n) = 1, A(n) = 1] 


P(A(n) — 1) 
= Lpo =k+1)P[D(n)=1,A(n)=1]|X(n)=k+1] 
= E T DAgu —z(k- l)n. 


Similarly, 


Ped Doyle m 5 AMM 


= LPM) = k)P[D(n) = 0, A(n) = 1 | X(n) = k] 


= Laaa — ul{k > 0) = x (K)(1 — ul(k > 0}. 


Hence, 


P 2 x (1 — ulik > 0) +k 4- Du. 
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Consequently, the expected time W that a packet spends in the switch is given by 


W= yom = Ll LY o9 = uM = 0) + Y rc 1) 


k>=0 kz0 kz0 


1 
E ) i 000 - 100 + 3k - Dak) 


k-0 k>1 


1 l-u 1 1 

= — + —— E(X) + E(X)-1=— +—E(X)-1 
[n [n u u 
1 al- 1— 1 

nig 0 s e a 
u nu -—À) M-h A 


3.3.4 ANote About Arrivals 


Since the arrivals are independent of the backlog in the buffer, it is tempting to 
conclude that the probability that a packet finds k packet in the buffer upon its arrival 
is x(k). An argument in favor of this conclusion looks as follows: 


P[Xn41 =k + 1| An 21] = P[Xn =k | An = 1] 
= P[X, =k] = x(k), 
where the second identity comes from the independence of the arrivals A, and the 
backlog X,,. However, the first identity does not hold since it is possible that X,41 = 
k, Xn =k, and A, = 1. Indeed, one may have D, = 1. 

If one assumes that A < u « 1, then the probability that A, = 1 and D, = 1 
is negligible and it is then the case that x (k) ~ z (k). We encounter that situation in 
Sect. 5.6. 

3.3.5 Little's Law 
The previous result is a particular case of Little’s Law (Little 1961) (Fig. 3.13). 
Theorem 3.3 (Little’s Law) Under weak assumptions, 


L — AW, 


where L is the average number of customers in a system, X is the average arrival 
rate of customers, and W is the average time that a customer spends in the system. 
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Fig. 3.13 John D. C. Little. 
b. 1928 


One way to understand this law is to consider a packet that leaves the switch 
after having spent T time units. During its stay, AT packets arrive, on average. So 
the average backlog in the switch should be AT. 

It turns out that Little's law applies to very general systems, even those that do 
not serve the packets in their order of arrival. 

One way to see this is to think that each packet pays the switch one unit of money 
per unit of time it spends in the switch. If a packet spends T time units, on average, 
in the switch, then each packet pays T, on average. Thus, the switch collects money 
at the rate of AT per unit of time, since A packets go through the switch per unit of 
time and each pays an average of T. Another way to look at the rate at which the 
switch is getting paid is to realize that if there are L packets in the switch at any 
given time, on average, then the switch collects money at rate L, since each packet 
pays one unit per unit time. Thus, L — AT. 


3.4 Multiple Access 


Imagine a number of smartphones sharing a WiFi access point, as illustrated in 
Fig. 3.14. They want to transmit packets. 

If multiple smartphones transmit at the same time, the transmissions garble one 
another, and we say that they collide. We discuss a simple scheme to regulate the 
transmissions and achieve a large rate of success. We consider a discrete time model 
of the situation. 

There are N devices. At time n > 0, each device transmits with probability p, 
independently of the others. This scheme, called randomized multiple access , was 
proposed by Norman Abramson in the late 1960s for his Aloha network (Abramson 
1970) (Fig. 3.15). 

The number X(n) of transmissions at time n is then B(N, p) (see (B.4)). In 
particular, the fraction of time that exactly one device transmits is 


P(X(n) = D = Np(1 — p)". 


The maximum over p of this success rate occurs for p = 1/N and it is A* where 


3.55 Summary 55 


Fig. 3.14 A number of 
smartphones share a WiFi = 
access point PX id - 


Fig. 3.15 Norman 
Abramson, b. 1932 


: p o3 
M={1-— x — & 0.36. 


In this derivation, we use the fact that 


a\N 
(1 - £) xe™ for N > 1. (3.7) 
N 
Thus, this scheme achieves a transmission rate of about 36%. However, it 
requires selecting p = 1/N, which means that the devices need to know how many 
other devices are active (i.e., try to transmit). We discuss an adaptive scheme in the 
next chapter that does not require that information. 


3.5 Summary 


e Gaussian random variable -V (u, o”); 

* CLT; 

* Confidence Intervals; 

e Buffers: average backlog and delay; Little's Law; 
* Multiple Access Protocol. 
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3.5.1 Key Equations and Formulas 


Definition of N (u, o?) fx(x) = (2z102)-1? exp{—(x — nu /Qo?)) D.3.1 
CLT (X4 +--- +X, —np)/ fn > W(0,o?) T3.1 
9596-Confidence Interval (X14 4 Xn)/n +20 8.3.2.3 
Little’s Law L--—AW T.3.3 
Exponential Approximation (1 — a/n)” © exp{—a} (3.7) 


3.6 References 


The buffering analysis is a simple example of queueing theory. See Kleinrock 
(1975—6) for a discussion of queueing models of computer and communication 
systems. 


3.7 Problems 


Problem 3.1 Write a Python code to compute the number of people to poll in a 
public opinion survey to estimate the fraction of the population that will vote in 
favor of a proposition within o percent, with probability at least 1 — 8. Use an upper 
bound on the variance. Assume that we know that p € [0.4, 0.7]. 


Problem 3.2 We are conducting a public opinion poll to determine the fraction p 
of people who will vote for Mr. Whatshisname as the next president. We ask Nj 
college-educated and M2 non-college-educated people. We assume that the votes 
in each of the two groups are i.d. B(p1) and B(p2), respectively, in favor of 
Whatshisname. In the general population, the percentage of college-educated people 
is known to be q. 


(a) What is a 95%-confidence interval for p, using an upper bound for the variance. 
(b) How do we choose Nj and N2 subject to Nj + N2 = N to minimize the width 
of that interval? 


Problem 3.3 You flip a fair coin 10,000 times. The probability that there are more 
than 5085 heads is approximately (choose the correct answer) 


15%; 
10%; 
5%; 
2.5%; 
1%. 
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Problem 3.4 Write a Python simulation of a buffer where packets arrive as a 
Bernoulli process with rate à and geometric service times with rate u. Plot the 
simulation and calculate the long-term average backlog. 


Problem 3.5 Consider a buffer that can transmit up to M packets in parallel. That 
is, when there are m packets in the buffer, min(m, M} of these packets are being 
transmitted. Also, each of these packets completes transmission independently in 
the next time slot with probability u. At each time step, a packet arrives with 
probability À. 


(a) What are the transition probabilities of the corresponding Markov chain? 
(b) For what values of à, M, and u do you expect the system to be stable? 
(c) Write a Python simulation of this system. 


Problem 3.6 In order to estimate the probability of head in a coin flip, p, you flip a 
coin n times, and count the number of heads, $,. You use the estimator p = S,,/n. 
You choose the sample size n to have a guarantee 


P(|Sa/n = p| z €) < ô. 


(a) What is the value of n suggested by Chebyshev’s inequality? (Use a bound on 
the variance.) 

(b) How does this value change when e is reduced to half of its original value? 

(c) How does it change when ó is reduced to half of its original value? 

(d) Compare this value of n with that given by the CLT. 


Problem 3.7 Let (X,,n > 1} be iid. U[O, 1] and Z, = X4 +--+ + Xn. What is 
P(Z, > n)? What would the estimate be of the same probability obtained from the 
Central Limit Theorem? 


Problem 3.8 Consider one buffer where packets arrive one by one every 2s and 
take 1 s to transmit. What is the average delay through the queue per packet? Repeat 
the problem assuming that the packets arrive ten at a time every 20s. This example 
shows that the delay depends on how “bursty” the traffic is. 


Problem 3.9 Show that if X (n)  X, then X(n) > X. 


Hint Assume that P(X = x) = 0. To show that P(X (n) < x) —> P(X < x), note 
that if |X (n) — X| < € and X < x, then X(n) < X + €. 
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Multiplexing: B 4 


Topics: Characteristic Functions, Proof of Central Limit Theorem, Adaptive 
CSMA 


4.1 Characteristic Functions 


Before we explain the proof of the CLT, we have to describe the use of characteristic 
functions. 


Definition 4.1 Characteristic Function The characteristic function of a random 
variable X is defined as 


óx(u) = E(X), u e s. 


In this expression, i :— 4/— 1. 


Note that 


ais f e"* Fy (x)dx, 


—oo0 
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so that x (u) is the Fourier transform of fx (x). As such, the characteristic function 


determines the pdf uniquely. 
As an important example, we have the following result. 


Theorem 4.1 (Characteristic Function of ~ (0, 1)) Let X =p A (0, 1). Then, 


masc 


Proof One has 


so that 


= i i 1 278 qui - f. iux 1 e 9 dx 
—oo ^ Qn —0o V 27 
= —udx (u). 


(The third equation follows by integration by parts.) Thus, 
7. logxu) = 
— jo = — = — — — 
gio m = 13 


which implies that 


u2 
$x(u) = Ae 7. 


Since dx (0) = E (e9X) = 1, we see that A = 1, and this proves the result (4.1). 


We are now ready to prove the CLT. 


4.2 Proof of CLT (Sketch) 


(4.1) 


Oo 


The technique to analyze sums of independent random variables is to calculate the 


characteristic function. Let then 


xa b X(n)— 
D+ +X ame Ll, 


Y (n) = RU = 


We have 
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$yg)(u) = E (gern) =E, (m., ep] 


[e(o [ese] 


E iu(X( —u) _ W (X0) - w)? " 
= E (1 + EXT 205 + oam) 


= [! — u? (2n) + od/n)| > exp [27] "m 


iu(X (m) — 2) 
o n 


The third equality holds because the X (m) are 1.1.d. and the fourth one follows from 
the Taylor expansion of the exponential: 


1 
^s 14a za. 
€ a 54 


Thus, the characteristic function of Y (n) converges to that of a // (0, 1) random 
variable. This suggests that the inverse Fourier transform, i.e., the density of Y (n) 
converges to that of a .//(0, 1) random variable. This last step can be shown 
formally, but we will not do it here. 


4.3 Moments of ~ (0, 1) 


We can use the characteristic function of a ./ (0, 1) random variable X to calculate 
its moments. This is how. First we note that, by using the Taylor expansion of the 
exponential, 


duy esc (ex) =E D sen) 


n=0 
< 1 
= J — (iu) E(X”). 
n! 
n=0 
Second, again using the expansion of the exponential, 


oo 


1 2 m 
ox(w) =e"? = Yoon a 


m=0 


2m 


Third, we match the coefficients of u^" in these two expressions and we find that 


XE (x) = M (-;) i 
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This gives! 


E (x") NE (4.2) 


For instance, 


2! 4! 
DN S mt 41. 7U — 
BOO) = qc = LE (x4) = 55 = 3. 


Finally, we note that the coefficients of odd powers of u must be zero, so that 
Eg OC) = o doris QOO os 
(This should be obvious from the symmetry of fx (x).) In particular, 


var(X) = E (x?) — E(X} 2 1. 


4.4 Sum of Squares of 2 i.i.d. /// (0, 1) 
Let X, Y be two i.i.d. VY (0, 1) random variables. The claim is that 
Z = X? + Y? =p Exp(1/2). 
Let 0 be the angle of the vector (X, Y) and R? = X? + Y?. Thus (see Fig.4.1) 
dxdy = rdrd0. 


Note that E(Z) = E(X?) + E(Y?) = 2, so that if Z is exponentially distributed, 
its rate must be 1/2. Let us prove that it is exponential. One has 


1 x2 ER y? 
fx.y(x, y)dxdy = fx y(x, y)rdrdé = 5 exp 2 rdrd0 
m 


=: fo(8)d0 x fr(r)dr, 


l We used the fact that 


m — =p”. 


4.5 Two Applications of Characteristic Functions 63 


Fig. 4.1 Under the change of 

variables x — r cos(0) and 

y —rsin(0), we see that 

dxdy = rdrd0. That is, 

[r,r 4- dr] x [0,0 + dé] y 
covers an area rdrdÓ in the 

(x, y) plane 


where 


r2 


fo) = 1l qo « 0 « 27} and fa(r) = resp |] l(r > 0}. 
2x 2 


Thus, the angle 0 of (X, Y) and the norm R = v X? + Y? are independent and have 
the indicated distributions. But then, if V = R? =: g(R), we find that, for v > 0, 


1 i »r 1l v 
ag m = sre | -l =30] ;| 


which shows that the angle 0 and V = X? + Y? are independent, the former being 
uniformly distributed in [0, 27] and the latter being exponentially distributed with 
mean 2. 


fv) = 


4.5 Two Applications of Characteristic Functions 

We have used characteristic functions to prove the CLT. Here are two other cute 
applications. 

4.5.1 Poisson as a Limit of Binomial 


A Poisson random variable X with mean A can be viewed as a limit of a B(n, A/n) 
random variable X,, as n — oo. To see this, note that 


E(expliuX,]) = E(expliu(Z, (1)) t... + Zam), 


where the random variables (Z5 (1), ..., Zn (n)} are 1.1.d. Bernoulli with mean A /n. 
Hence, 
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Y a n À iu à 
E(exp{iuX,}) = [E(exp{iu(Z,(1)}))]" = | + a = D| . 
For the second identity, we use the fact that if Z =p B(p), then 


E(exp{iuZ}) = (1 — p)e? + pe" = 1 + p(e™ — 1). 


Also, since 
m " oo a" 
— — p a — 
P(X =m) = —e and e“ = > T 
m=0 
we find that 
oo jm 
E(exp{iuX}) = > —— exp(—A]e/"" = exp(A(e/" — 1}. 
m! 

m=0 


The result then follows from the fact that 


aN” 5 
(1+2) — e", asn — oo. 
n 


4.5.2 Exponential as Limit of Geometric 


An exponential random variable can be viewed as a limit of scaled geometric 
random variables. Let X =p Exp(A) and X, = G(A/n). Then 


— X, — X, in distribution. 
n 


To see this, recall that 
fx (x) 2 Ae ^" 1x > 0). 


Also, 


if the real part of 6 is positive. 
Hence, 
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Moreover, since 
P(X; =m) = (1— p)" p,m > 0, 


we find that, with p = 4 


n? 


al EN " 
E c {ix} = a — p)" pexplium/n) 


m=0 
-- = _ : m p 
= p 2a pyexpliu/n)]" = pim cm 
u A/n 
~ 1- 0 - A/n) expliu/n) 
u À 
nQ- (1— A/n)(1 + iu/n + o(1/n))) 

Xr 


^ XA—iu- o(1/n)' 


where o(1/n) — 0 as n — oco. This proves the result. 


4.6 Error Function 
In the calculation of confidence intervals, one uses estimates of 
Q(x) := P(X > x) where X =p M” (0, 1). 
The function Q (x) is called the error function. With Python or the appropriate smart 
phone app, you can get the value of Q(x). Nevertheless, the following bounds (see 


Fig. 4.2) may be useful. 


Theorem 4.2 (Bounds on Error Function) One has 


x? 
«Qo < ew [-5 ] v o 


x 1 | | 1 
ex 
1+x2 /2n p 2 xA/2m 


Proof Here is a derivation of the upper bound. For x > 0, one has 


y _» 


1 ee 
= Le 
zS y 


ad e F »? 
ow) = f feos = f ELE 
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Fig. 4.2 The error function 0.09 7 X r r T 
Q(x) and its bounds 
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For the lower bound, one uses the following calculation, again with x 0: 


i3 pe og M 1\ _» 
(+3) / ray> f (1+5) e 2dy 
x x x 34 


e 2. 


V 


ll 

| 
m 
8 

a 
T ES 
<i] 

e 

| 
No 
Nae 

ll 
x |= 

^ 

| 
Nl 


4.7 Adaptive Multiple Access 


In Sect. 3.4, we explained a randomized multiple access scheme. In this scheme, 
there are N active station and each station attempts to transmit with probability 
1/N in each time slot. This scheme results in a success rate of about 1/e ~ 36%. 
However, it requires that each station knows how many other stations are active. 

To make the scheme adaptive to the number of active devices, say that the devices 
adjust the probability p(n) with which they transmit at time n as follows: 


p(n), if X(n) = 1; 
p 4 1) 2 4 ap(n), if X(n) > l; 
min{bp(n), 1), if X (n) = 0. 
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Fig. 4.3 Bruce Hajek 


0 200 400 600 800 1000 1200 


Fig. 4.4 Throughput of the adaptive multiple access scheme 


In these update rules, a and b are constants with a € (0, 1) and b > 1. The idea 
is to increase p(n) if no device transmitted and to decrease it after a collision. This 
scheme is due to Hajek and Van Loon (1982) (Fig. 4.3). 

Figure 4.4 shows the evolution over time of the success rate Tp. Here, 


n—1 


1 
Tn = = xen = n. 


m=0 


The figure uses a = 0.8 and b = 1.2. We see that the throughput approaches the 
optimal value for N = 40 and for N = 100. Thus, the scheme adapts automatically 
to the number of active devices. 
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4.8 Summary 


* Characteristics Function; 

* Proof of CLT; 

* Moments of Gaussian; 

* Sum of Squares of Gaussians; 

* Poisson as limit of Binomial; 

* Exponential as limit of Geometric; 
* Adaptive Multiple Access Protocol. 


4.8.1 Key Equations and Formulas 


Characteristic Function ox (u) = E(exp{iuX}) D.4.1 
For M (0, 1) exp(—u?/2) T.4.1 
Moments of M (0, 1) E(X?") = (2m)!/ (m2?) (4.2) 
Error Function P (./ (0, 1) > x) Bounds T.4.2 


4.9 References 


The CLT is a classical result, see Bertsekas and Tsitsiklis (2008), Grimmett and 
Stirzaker (2001) or Billingsley (2012). 


4.10 Problems 


Problem 4.1 Let X be a N(0, 1) random variable. You will recall that E(X 2) =1 
and E(X*) = 3. 


(a) Use Chebyshev’s inequality to get a bound on P(|X| > 2); 

(b) Use the inequality that involves the fourth moment of X to bound P(|X| > 2). 
Do you get a better bound? 

(c) Compare with what you know about the N (0, 1) random variable. 


Problem 4.2 Write a Python simulation of Hajek's random multiple access 
scheme. There are 20 stations. An arrival occurs at each station with probability 
4/20 at each time slot. The stations update their transmission probability as 
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explained in the text. Plot the total backlog in all the stations as a function of 
time. 


Problem 4.3 Consider a multiple access scheme where the N stations indepen- 
dently transmit short reservation packets with duration equal to one time unit with 
probability p. If the reservation packets collide or no station transmits a reservation 
packet, the stations try again. Once a reservation is successful, the succeeding station 
transmits a packet during K time units. After that transmission, the process repeats. 
Calculate the maximum fraction of time that the channel can be used for transmitting 
packets. Note: This scheme is called Reservation Aloha. 


Problem 4.4 Let X be arandom variable with mean zero and variance 1. Show that 
E(X*) > 1. 


Hint Use the fact that E((X? — 1)?) > 0. 

Problem 4.5 Let X, Y be two random variables. Show that 
(EY)? < EQO)EQ. 

This is the Cauchy—Schwarz inequality. 


Hint Use E((AX — Y?) > 0 with A = E(XY)/E(X?). 
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Application: Social Networks, Communication Networks 
Topics: Random Graphs, Queueing Networks 


5.1 Spreading Rumors 


Picture yourself in a social network. You are connected to a number of “friends” 
who are also connected to friends. You send a message to some of your friends and 
they in turn forward it to some of their friends. We are interested in the number of 
people who eventually get the message. 

To explore this question, we model the social network as a random tree of which 
you are the root. You send a message to a random number of your friends that 
we model as the children of the root node. Similarly, every node in the graph 
has a random number of children. Assume that the numbers of children of the 
different nodes are independent, identically distributed, and have mean jz. The tree 
is potentially infinite, a clear mathematical idealization. The model ignores cycles 
in friendships, another simplification. 

The model is illustrated in Fig. 5.1. Thus, the graph only models the people who 
get a copy of the message. For the problem to be non-trivial, we assume that there is 
a positive probability that some nodes have no children, i.e., that someone does not 
forward the message. Without this assumption, the message always spreads forever. 

We have the following result. 
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Fig. 5.1 The spreading of a 
message as a random tree 


Theorem 5.1 (Spreading of a Message) Let Z be the number of nodes that 
eventually receive the message. 


(a) If w < 1, then P(Z < oo) = 1 and E(Z) < oo; 
(b) If u > 1, then P(Z = œ) > €. 


We prove that result in the next chapter. The result should be intuitive: if Z < 1, 
the spreading dies out, like a population that does not reproduce enough. This model 
is also relevant for the spread of epidemics or cyber viruses. 


5.2 Cascades 


If most of your friends prefer Apple over Samsung, you may follow the majority. In 
turn, your advice will influence other friends. How big is such an influence cascade? 

We model that situation with nodes arranged in a line, in the chronological order 
of their decisions, as shown in Fig. 5.2. Node n listens to the advice of a subset of 
(0, 1, ..., n — 1} who have decided before him. Specifically, node n listens to the 
advice of node n — k independently with probability pg, fork = 1,...,n. If the 
majority of these friends are blue, node n turns blue; if the majority are red, node 
n turns red; in case of a tie, node n flips a fair coin and turns red with probability 
1/2 or blue otherwise. Assume that, initially, node 0 is red. Does the fraction of red 
nodes become larger than 0.5, or does the initial effect of node 0 vanish? 

A first observation is that if nodes listen only to their left-neighbor with 
probability p € (0, 1), the cascade ends. Indeed, there is a first node that does 
not listen to its neighbor and then turns red or blue with equal probabilities. 
Consequently, there will be a string of red nodes followed by a string of blue node, 
and so on. By symmetry, the lengths of those strings are independent and identically 
distributed. It is easy to see they have a finite mean. The SLLN then implies that the 
fraction of red nodes among the first n nodes converges to 0.5. In other words, the 
influence of the first node vanishes. 

The situation is less obvious if py = p < 1 for all k. Indeed, in this case, as n 
gets large, node n is more likely to listen to many previous neighbors. The slightly 
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Fig. 5.2 An influence 
cascade 
e des 00000000 


TL 


surprising result is that, no matter how small p is, there is a positive probability that 
all the nodes turn red. 


Theorem 5.2 (Cascades) Assume py = p € (0, 1] for all k > 1. Then, all nodes 
turn red with probability at least equal to 0 where 


1— 
?-ep|-—*]. 
P 


We prove the result in the next chapter. It turns out to be possible that every node 
listens to at least one previous node. In that case, all the nodes turn red. 


5.3 Seeding the Market 


Some companies distribute free products to spread their popularity. What is the best 
fraction of customers who should get free products? To explore this question, let 
us go back to our model where each node listens only to its left-neighbor with 
probability p. The system is the same as before, except that each node gets a free 
product and turns red with probability A. The fraction of red nodes increases in À 
and we write it as v (A). If the cost of a product is c and the selling price is s, the 
company makes a profit (s — c)v (A) — cA since it makes a profit s — c from a buyer 
and loses c for each free product. The company then can select A to optimize its 
profit. Next, we calculate v (À). 

Let x(n — 1) be the probability that user n — 1 is red. If user n listens to n — 1, 
he turns red unless n — 1 is blue and he does not get a free product. If he does not 
listen to n — 1, he turns red with probability 0.5 if he does not get a free product and 
with probability one otherwise. Thus, 


z (n) = p(1 — (1 — z(n — D) — 2)) +A- p)(0.5(1 — A) +A) 
= p(1— Ayz (n — 1) + 0.5Ap + 0.5 + 0.52 — 0.5p + 0.5Ap. 


Since p(1 — à) < 1, the value of zr (rt) converges to the value y (A) that solves the 
fixed point equation 


WA) = p(1 — YYA) + 0.5Ap + 0.5 + 0.54 — 0.5p + 0.5A p. 
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Hence, 


1+A—p+t+a 
iO) S05 COM CAE 
1— p(1—2) 
To maximize the profit (s — c)W(A) — cA, we substitute the expression for y (à) 
in the profit and we set the derivative with respect to à equal to zero. After some 
algebra, we find that the optimal A* is given by 


NOD 0— p!2—-- p) [0.5 — 2] 
À* = min 4 1, : 
p c 


Not surprisingly, A* increases with the profit margin (s — c)/c and decreases with 
p. 


5.4 Manufacturing of Consent 


Three people walk into a bar. No, this is not a joke. They chat and, eventually, leave 
with the same majority opinion. As such events repeat, the opinion of the population 
evolves. We explore a model of this evolution. 

Consider a population of 2N > 4 people. Initially, half believe red and the other 
half believe blue. We choose three people at random. If two are blue and one is red, 
they all become blue, and they return to the general population. The other cases are 
similar. The same process then repeats. Let X, be the number of blue people after n 
steps, forn > 1 and let Xo = N. Then X, is a Markov chain. This Markov chain has 
two absorbing states: 0 and 2N. Indeed, if X, = k for some k € {1,...,2N — 1}, 
there is a positive probability of choosing three people where two have one opinion 
and the third has a different one. After their meeting, X,+1 Z Xn. The Markov 
chain is such that P(1, 0) = 1 and P(2N — 1,2N) = 1. Moreover, P(k, k) > 
0, P(k, k+ 1) > 0, and P(k, k — 1) > Oforall k € (2, ..., 2N — 2). Consequently, 
with probability one, 


lim X, € {0, 2N}. 
n—oo 


Thus, eventually, everyone is blue or everyone is red. By symmetry, the two limits 
have probability 0.5. 

What is the effect of the media on the limiting consensus? Let us modify our 
previous model by assuming that when two blue and one red person meet, they all 
turn blue with probability 1 — p and remain as before with probability p. Here p 
models the power of the media at convincing people to stay red. If two red and one 
blue meet, they all turn red. 

We have, for k € (2, ...,2N — 2}, 
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P[X,41 2k - 1| X, 2 k] 2 (1 uds: i: NN 
[ n+1 =K + | n=k]=( p) 2NQN — DON — 2) =" P( ). 


Indeed, X, increases with probability 1 — p from k to k + 1 if in the meeting 
two people are blue and one is red. The probability that the first one is blue is 
k/(2N) since there are k blue people among 2N. The probability that the second 
is also blue is then (k — 1)/(2N — 1). Also, the probability that the third is red is 
(2N — k)/(2N — 2) since there are 2N — k red people among the 2N — 2 who 
remain after picking two blue. Finally, there are three orderings in which one could 
pick one red and two blue. 
Similarly, for k € (2, ..., 2N — 2}, 


u ua  Q.0N-EQN-k-lLDXk . 
P[Xns41 =k—1|X, =k] =3 2NQN — DON — 2) = q (k). 


We want to calculate 
a(k) = P[Thy < To | Xo = k], 
where 70 is the first time that X,, = 0 and Tzy is the first time that X,, = 2N. Then, 


o (N) is the probability that the population eventually becomes all red. 
The first step equations are, for k € {2,...,2N — 2}, 


a(k) = p(k)a(k + 1) + q(k)a(k — 1) + (1 — p(k) — q(k))a(k), 


(p(k) + q(k))a(k) = p(k)a(k + 1) + q(k)a(k — 1). 


The boundary conditions are a(1) = 0, a(2N — 1) = 1. 
We solve these equations numerically, using Python. Our procedure is as follows. 
We let a(1) = 0 and a(2) = A, for some constant A. We then solve recursively 


a(k +1) = 1+ yak) ak 1), 22,3,...,2N —2 
2N -k-1 
= ( *ü-p«- 5) d 
2N — k — 1) 


g—1,ko5,3,...,2N — 2. 
(1 — p)Y(k — 1) 


Eventually, we find «(2N — 1). This value is proportional to A. Since a(2N — 
1) = 1, we then divide all the a(k) by a(2N — 1). Not elegant, but it works. We 


repeat this process for p = 0, 0.02, 0.04, ..., 0.14. Figure 5.3 shows the results for 
N = 450, i.e., for a population of 900 people. 
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Fig. 5.3 The effect of the media. Here, p is the probability that someone remains red after chatting 
with two blue people. The graph shows the probability that the whole population turns blue instead 
of red. A small amount of persuasion goes a long way 


5.5 Polarization 


In most countries, the population is split among different political and religious 
persuasions. How is this possible if everyone is faced with the same evidence? One 
effect is that interactions are not fully mixing. People belong to groups that may 
converge to a consensus based on the majority opinion of the group. 

To model this effect, we consider a population of N people. An adjacency matrix 
G specifies which people are friends. Here, G(v, w) = 1 if v and w are friends and 
G (v, w) — 0 otherwise. 

Initially, people are blue or red with equal probabilities. We pick one person at 
random. If that person has a majority of red friends, she becomes red. If the majority 
of her friends are blue, she becomes blue. If itis a tie, she does not change. We repeat 
the process. Note that the graph does not change; it is fixed throughout. We want to 
explore how the coloring of people evolves over time. 

Let X4,(v) € (B, R} be the state of person v at time n, for n > 0 and v € 
{1,..., N}. We pick v at random. We count the number of red friends and blue 
friends of v. They are given by 


X GU, w)1(X,(w) = R} and X` Gv, w)( X, (w) = B]. 


Thus, 
R, if $5, Gw, w)lH{X, (w) = R} > 3, G(v, w) 1{Xn(w) = B] 
Xn+i(v) = 4 B, if $5, Gw, w)lH{X (w) = R} < 3, G(v, w) 1{Xn(w) = B] 


Xn(v), otherwise. 


We have the following result. 
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Theorem 5.3 The state X, = {Xnj(v),v = 1,...,N} of the system always 
converges. However, the limit may be random. 


Proof Define the function V (Xn) as follows: 


V(Xn) = 33 106 Q) # Xn(w)}. 


That is, V(X,,) is the number of disagreements among friends. The rules of 
evolution guarantee that V(X,+41) < V(X,) and that P(V(X,41) < V(X,)) > 0 
unless P(X,,; = Xn) = 1. Indeed, if the state of v changes, it is to make that 
person agree with more of her neighbors. Also, if there is no v who can reduce 
her number of disagreements, then the state can no longer change. These properties 
imply that the state converges. 

A simple example shows that the limit may be random. Consider four people at 
the vertices of a square that represents G. Assume that two opposite vertices are 
blue and the other two are red. If the first person v to reconsider her opinion is blue, 
she turns red, and the limit is all red. If v is red, the limit is all blue. Thus, the limit 
is equally likely to be all red or all blue. o 


In the limit, it may be that a fraction of the nodes are red and the others are blue. 
For instance, if the nodes are arranged in a line graph, then the limit is alternating 
sequences of at least two red nodes and sequences of at least two blue nodes. 

The properties of the limit depend on the adjacency graph G. One might think 
that a close group of friends should have the same color, but that is not necessarily 
the case, as the example of Fig. 5.4 shows. 


Fig. 5.4 A close group of 
friends, the four vertices of 
the square, do not share the 
same color 
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5.6 | M/M]/1 Queue 


We discuss a simple model of a queue, called an M / M/1 queue. We turn to networks 
in the next section. This section uses concepts from continuous-time Markov chains 
that we develop in the next chapter. Thus, the discussion here is a bit informal, but 
is hopefully clear enough to be read first. 

Figure 5.5 illustrates a queue where customers (this is the standard terminology) 
arrive and a server serves them one at a time, in a first come, first served order. 
The times between arrivals are independent and exponentially distributed with 
rate A. Thus, the average time between two consecutive arrivals is 1/A, so that X 
customers arrive per unit of time, on average. The service times are independent 
and exponentially distributed with rate u. The durations of the service times and the 
arrival times are independent. The expected value of a service time is 1/u. Thus, if 
the queue were always full, there would be jz service completions per unit time, on 
average. If A < yw, the server can keep up with the arrivals, and the queue should 
empty regularly. If A > u, one can expect the number of customers in the queue to 
increase without bound. 

In the notation M/M/1, the first M indicates that the inter-arrival times are 
memoryless, the second M indicates that the service times are memoryless, and the 
1 indicates that there is one server. As you may expect, there are related notations 
such as D/M/3 or M/G/5, and so on, where the inter-arrival times and the service 
times have other properties and there are multiple servers. 

Let X; be the number of customers in the queue at time t, for t > 0. We call X; 
the queue length process. The middle part of Fig. 5.5 shows a possible realization 
of that process. Observing the queue length process up to some time f provides 
information about previous inter-arrival times and service times and also about 
when the last arrival occurred and when the last service started. Since the inter- 
arrival times and service times are independent and memoryless, this information 
is independent of the time until the next arrival or the next service completion. In 
particular, given (X,, s < t}, the likelihood that a new arrival occurs during (t, f 4- €] 
is approximately Ae for e < 1; also the likelihood that a service completes during 
(t, t + e] is approximately ue if X; > 0 and zero if X, = 0. 


Fig. 5.5 An M/M/1 queue, X, 
: EUM . À H 
a possible realization, and its L L^ y 
state transition diagram : 
X 
coy = | 
» 
À À A À 
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The bottom part of Fig. 5.5 is a state transition diagram that indicates the rates of 
transitions. For instance, the arrow from 1 to 2 is marked with A to indicate that, in 
€ « ls, the queue length jumps from 1 to 2 with probability Ae. The figure shows 
that arrivals (that increase the queue length) occur at the same rate A, independently 
of the queue length. Also, service completions (that reduce the queue length) occur 
at rate u as long as the queue is nonempty. 

Note that 

P(Xi+e = 0) = P(X; = 0, Xie = 0) + P(X = 1, Xie = 0) 
x P(X, = 0)(1 — Ae) + P(X; = Dupe. 
The first identity is the law of total probability: the event {X;,< = 0) is the union of 
the two disjoint events {X; = 0, X; = 0} and (X, = 1, X; = 0). The second 
identity uses the fact that (X, = 0, X;+e = 0} occurs when X; = 0 and there is no 
arrival during (t, t -- €]. This event has probability P(X; = 0) multiplied by (1 —Ae) 
since arrivals are independent of the current queue length. The other term is similar. 

Now, imagine that x is a pmf on Z>ọ :— (0, 1,...] such that P(X; = i) = z(i) 
for all time t andi € Z>ọ. That is, assume that x is an invariant distribution for X;. 
In that case, P(X;4 4 = 0) = x (0), P(X; = 0) = z (0), and P(X; = 1) = z(1). 
Hence, the previous identity implies that 


z (0) © x(0)(1 — Ae) + zx (Dpe. 
Subtracting z (0)(1 — Ae) from both terms gives 
z (0)Ae © zn (l)e. 
Dividing by e shows that! 
7 (0) = z(1)4. (5.1) 
Similarly, for i > 1, one has 


P(Xize =i) = P(X; =i — 1l, Xige — i) + P(X: =i, Xie — i) 
+ P(X, =i4+1, Xe =i) 
x P(X, =i — 1)àe + P(X; =i)(1 — Ae — ue) + P(X, =i + Dye. 


Hence, 


z (i) & x (i — D)Ae + x (i)(1l — Ae — we) 4 z (i + Dye. 


VTechnically, the ~ sign is an identity up to a term negligible in e. When we divide by e and let 
€ — 0, the ~ becomes an identity. 
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This relation implies that 
7 (1)( 4- &u) 2 x(i — 1D)A 4 z( --l)n, i — 1. (5.2) 


The Eqs. (5.1)-(5.2) are called the balance equations. Thus, if x is invariant for X;, 
it must satisfy the balance equations. Looking back at our calculations, we also see 
that if x satisfies the balance equations, and if P(X; = i) = rr (i) for all i, then 
P(X;4e = i) = n(i) for all i. Thus, z is invariant for X; if and only if it satisfies 
the balance equations. 

One can solve the balance equations (5.1)-(5.2) as follows. Equation (5.1) shows 
that z (1) = oz (0) with o = à/u. Subtracting (5.1) from (5.2) yields 


z(1)A = x (2)n. 


This equation then shows that x (2) = z(1)o = x (0)p?. Continuing in this way 
shows that z (n) = z (0) o" for n > 0. To find z (0), we use the fact that $^, z (n) = 
1. That is 


oo 
by z (0)p" = 1. 
n=0 


If o > l,ie,ifA > yw, this is not possible. In that case there is no invariant 


distribution. If p < 1, then the previous equation becomes 
1 
(0) —— = 1, 
l= p 


so that 7(0) = 1 — p and 
n (n) = (1 — p)p",n > 0. 


In particular, when X; has the invariant distribution zr, one has 


oo 


E(X) = n(l— p)" = — = —— =: L. 


To calculate the average delay W of a customer in the queue, one can use Little’s 
Law L = AW. This identity implies that 
1 
S 
Another way of deriving this expression is to realize that if a customer finds 
k other customers in the queue upon his arrival, he has to wait for k + 1 service 
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completions before he leaves. Since very service completions lasts 1/jz on average, 
his average delay is (k + 1)/j. Now, the probability that this customer finds k other 
customers in the queue is zr (k). To see this, note that the probability that a customer 
who enters the queue between time f and f + € finds k customers in the queue is 


P[X; = k | Xie = X; + 1]. 


Now, the conditioning event is independent of X;, because the arrivals occur at 
rate à, independently of the queue length. Thus, the expression above is equal to 
P(X; = k) = n (k). Hence, 


oo 


as some simple algebra shows. 


5.7 Network of Queues 


Figure 5.6 shows a representative network of queues. Two types of customers arrive 
into the network, with respective rates y; and y2. The first type goes through queue 
1, then queue 3, and should leave the network. However, with probability p; these 
customers must go back to queue 1 and try again. In a communication network, this 
event models an transmission error where a packet (a group of bits) gets corrupted 
and has to be retransmitted. The situation is similar for the other type. Thus, in 
€ « | time unit, a packet of the first type arrives with probability yı €, independently 
of what happened previously. This is similar to the arrivals into an M/M/1 queue. 
Also, we assume that the service times are exponentially distributed with rate jz; in 
queue k, for k = 1,2,3. 

Let X ; be the number of customers in queue k at time t, fork = 1, 2 and t > 0. 
Let also x? be the list of customer types in queue 3 at time f. For instance, in 
Fig. 5.6, one has x = (1, 1, 2, 1), from tail to head of the queue to indicate that the 
customer at the head of the queue is of type 1, that he is followed by a customer of 


Er es P As pi (9.2.0.2) (22044120) (2, (,1,2,1)) 
^h 000 OCMC mud I^ f» 
A22 H3 |P2 "m y 2 0,1 a T" 
UK LL H2 , jh , a 
(42,0,1,2)  (311,2,1,1,2,1)) (3,3, (1,1,2, 1) 
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Fig. 5.6 A network of queues 
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type 2, etc. Because of the memoryless property of the exponential distribution, the 
process X; — (X1, x. X 3) is a Markov chain: observing the past up to time t does 
not help predict the time of the next arrival or service completion. 

Figure 5.6 shows the transition rates out of the current state (3, 2, (1, 1, 2, 1)). 
For instance, with rate u3 p1, a service completes in queue 3 and that customer has 
to go back to queue 1, so that the new state is (4, 2, (1, 1, 2)). The other transitions 
are similar. 

One can then, in principle, write down the balance equations and try to solve 
them. This looks like a very complex task and it seems very unlikely that one 
could solve these equations analytically. However, a miracle occurs and one has the 
remarkably simple result stated in the next theorem. Before we state the result, we 
need to define A1, A2, and A5. As sketched in Fig. 5.6, for k = 1, 2, 3, the quantity 
Ax is the rate at which customers go through queue k, in the long term. These rates 
should be such that 


Àj = yit Api 
À2 = y2 + A2p2 
À3 = Ay + À2. 


For instance, the rate 4; at which customers enter queue 1 is the rate y; plus the rate 
at which customers of type 1 that leave queue 3 are sent back to queue 1. Customers 
of type 1 go through queue 3 at rate A4, since they come out of queue 1 at rate A4; 
also, a fraction p, of these customers go back to queue 1. The other expressions 
can be understood similarly. The equations above are called the flow conservation 
equations. 

These equations admit the following solution: 


& il rp y2 
]—pi 


À 
1 em 


,A3 = À1 A2. 
2 


Theorem 5.4 (Invariant Distribution of Network) Assume Ay < uk and let 
Pk = Ax/ uio, for k = 1,2,3. Then the Markov chain X, has a unique invariant 
distribution x that is given by 
T (X1, X2, X3) = ni (x1)2(x2)73 Qa) 
73(01, a2, ..., an) = p(a1) p(a2) --- p(an)(1 — e3)05, 
nzO,ay€e(L2Lk-l,...,n, 


where p(1) = X1/(X1 + 42) and p(2) = h2/(A1 + A2). 
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This result shows that the invariant distribution has a product form. 

We prove this result in the next chapter. It indicates that under the invariant 
distribution zr, the states of the three queues are independent. Moreover, the state 
of queue | has the same invariant distribution as an M/M/1 queue with arrival rate 
A4 and service rate u1, and similarly for queue 2. Finally, queue 3 has the same 
invariant distribution as a single queue with arrival rates A; and Az and service rate 
u3: the length of queue 3 has the same distribution as an M/M/1 queue with arrival 
rate 41 + A3 and the types of the customers in the queue are independent and of type 
] with probability p(1) and 2 with probability p(2). 

This result is remarkable not only for its simplicity but mostly because it is 
surprising. The independence of the states of the queues is shocking: the arrivals into 
queue 3 are the departures from the other two queues, so it seems that if customers 
are delayed in queues 1 and 2, one should have larger values for X : and X : and a 
smaller one for the length of queue 3. Thus, intuition suggests a strong dependency 
between the queue lengths. Moreover, the fact that the invariant distributions of the 
queues are the same as for M/M/1 queues is also shocking. Indeed, if there are 
many customers in queue 1, we know that a fraction of them will come back into 
the queue, so that future arrivals into queue 1 depend on the current queue length, 
which is not the case for an M/ M/1 queue. The paradox is explained in a reference. 

We use this theorem to calculate the delay of customers in the network. 


Theorem 5.5 Fork — 1,2, the average delay Wy of customers of type k is given by 


W 1 ( 1 4 1 ) 
p= ; 
l— pk \Uuk— Àk 3 — A 22 


where 


Proof We use Little’s Law that says that Lg = yxWx where Ly, is the average 
number of customers of type k in the network. Consider the case k = 1. The other 
one is similar. Lı is the average number of customers in queue | plus the average 
number of customers of type 1 in queue 3. 

The average length of queue 1 is A1/(u1 — A1) because the invariant distribution 
of queue 1 is the same as that of an M/M/1 queue with arrival rate A; and service 
rate 41. 

The average length of queue 3 is (Ay + A2)/(u3 — X1 — A2) because the invariant 
distribution of queue 3 is the same as queue with arrival rate A; and Az and service 
rate u3. Also, the probability that any customer in queue 3 is of type 1 is p(1) = 
24/(X1 + A2). Thus, the average number of customers of type 1 in queue 3 is 
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Aic 22 M 
p) = 
HB3—1—22 m=- 
Hence, 
A À 
L|-— > + : : 

i= à  ua—2»1—22 

Combined with Little's Law, this expression yields W4. o 


5.8 Optimizing Capacity 


We use our network model to optimize the rates of the transmitters. The basic idea 
is that nodes with more traffic should have faster transmitter. To make this idea 
precise, we formulate an optimization problem: minimize a delay cost subject to a 
given budget for buying the transmitters. 

We carry out the calculations not because of the importance of the specific 
example (it is not important!) but because they are representative of problems of 
this type. 

Consider once again the network in Fig.5.6. Assume that the cost of the 
transmitters is c1441 + c242 + c3u3. The delay cost is di W1 + d2W2 where Wg 
is the average delay for packets of type k (k — 1, 2). The problem is then as follows: 


Minimize D(j11, u2, u3) := dj W, + doW2 


subject to C (u1, H2, 3) :— cip + CoM? + e3p3 € B. 


Thus, the objective function is 


d, 1 1 
D(u1, u2, ua) = a k ( + | 


=p 1— pk \Uk— Àk 3 — Ay — Az 


We convert the constrained optimization problem into an unconstrained one by 
replacing the constraint by a penalty. That is, we consider the problem 


Minimize D(u1, u2, 43) + a (C(p1, u2, 43) — B), 


where A > 0 is a Lagrange multiplier that penalizes capacities that have a high cost. 
To solve this problem for a given value of A, we set to zero the derivative of this 
expression with respect to each ug. For k = 1, 2 we find 
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0 0 
0 = —— D(m, M2, ua) + ——aC (u1, U2, U3) 
Ok OM 


dk 1 " 
[04 
1 — px (ux — Ax)? 


Ck. 


For k = 3, we find 


di/(1 — pi) + do/(1 — p2) 
(u3 — Ay — Az)? 


0- + aca. 


Hence, 


di 1/2 
n c (tz) , fork = 1,2 
ack (1 — px) 


di/(— pi) 4- doJ(1 — a 


i ces ( 
QC3 


These identities express 441, 42, and u3 in terms of a. Using these expressions in 
C (uU, H2, u3), we find that the cost is given by 


C(u1,. H2, M3) = C124 + €2À2 + c3 (À1 + A3) 


1 dk V an dk 
ea) 


Pk 12 1— px 


1/2 


Using C(u, 2, u3) = B then enables to solve for o. As a last step, we substitute 
that value of o in the expressions for the ug. We find, 


d 1/2 
Wk =m+d(— 4) , fork =1,2 
ck(1 — px) 


1/2 
dk 


u3 = ài +2 + D M D 


k=1,2 


where 
B — cià — c2Az2 — €3(Ay + A2) 


i dick be 1/2 dyck La 
»- 1— pk ls C3 i2 1—pi 
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These results show that, for k = 1,2, the capacity ug increases with dx, i.e., 
the cost of delays of packets of type k; it also decreases with cx, i.e., the cost of 
providing that capacity. 

A numerical solution can be obtained using a scipy optimization tool called 
minimize. Here is the code. 


import numpy as np 
from scipy.optimize import minimize 


d = [1, 2] # delay cost coefficients 

c = [2, 3, 4] # capacity cost coefficients 

1 = [3, 2] # rates 1[0] = lambdal, etc 

p = [0.1, 0.2] # error probabilities 

B = 60 # capacity budget 

UB = 50 # upper bound on capacity 

# x = mul, mu2, mu3: x[0] = mul, etc 

def objective(x): # objective to minimize 
Z= 0 


for k in range(2): 
z= z + (dIk]/(1 - plkl)) «(1/(xlk] - 1[k]) 
* 1/(x[2] - 1[0]-1[11)) 
return z 
def constraint (x): # budget constraint >= 0 
z-B 
for k in range(3): 
z = Z - Clk] *x[k] 


return z 
x0 = [5,5,10] # initial value for optimization 
bo = (1[0], UB) # lower and upped bound for x[0] 
b1 = (1[1], UB) # lower and upped bound for x[1] 
b2 = (1[0]+1[1], UB) # lower and upped bound for x[1] 
bnds = (b0,b1,b2) # bounds for the three variables x 


con = ('type': 'ineq', 'fun': constraint] 
# specifies constraints 

sol = minimize (objective,x0,method-'SLSQP', 
bounds - bnds, constraints-con) 

# sol will be the solution 

print (sol) 


The code produces an approximate solution. The advantage is that one does not 
need any analytical skills. The disadvantage is that one does not get any qualitative 
insight. 
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5.9 Internet and Network of Queues 


Can one model the internet as a network of queues? If so, does the result of 
the previous section really apply? Well, the mathematical answers are maybe and 
maybe. 

The internet transports packets (groups of bits) from node to node. The nodes 
are sources and destinations such as computers, webcams, smartphones, etc., and 
network nodes such as switches or routers. The packets go from buffer to buffer. 
These buffers look like queues. The service times are the transmission times of 
packets. The transmission time of a packet (in seconds) is the number of bits in 
the packet divided by the rate of the transmitter (in bits per second). The packets 
have random lengths, so the service times are random. So, the internet looks like a 
network of queues. However, there are some important ways in which our network 
of queues is not an exact model of the internet. First, the packet lengths are not 
exponentially distributed. Second, a packet keeps the same number of bits as it 
moves from one queue to the next. Thus, the service times of a given packet in the 
different queues are all proportional to each other. Third, the time between the arrival 
two successive packets from a given node cannot be smaller than the transmission 
time of the first packet. Thus, the arrival times and the service times in one queue 
are not independent and the times between arrivals are not exponentially distributed. 

The real question is whether the internet can be approximated by a network 
similar to that of the previous section. For instance, if we use that model, are we 
very far off when we try to estimate delays of queue lengths? Experiments suggest 
that the approximation may be reasonable to a first order. One intuitive justification 
is the diversity of streams of packets. It goes as follows. Consider one specific queue 
in a large network node of the internet. This node is traversed by packets that come 
from many different sources and go to many destinations. Thus, successive packets 
that arrive at the queue may come from different previous nodes, which reduces 
the dependency of the arrivals and the service times. The service time distribution 
certainly affects the delays. However, the results obtained assuming an exponential 
distribution may provide a reasonable estimate. 


5.10 Product-Form Networks 


The example of the previous sections generalizes as follows. There are N > 1 
queues and C > 1 classes of customers. At each queue i, customers of class c € 
{1,..., CJ arrive with rate yf, independently of the past and of other arrivals. Queue 
i serves customers with rate u;i. When a customer of class c completes service in 
queue i, it goes to queue j and becomes a customer of class d with probability 
aoe fori, j € {1,..., N) andc,d € (1,..., C}. That customer leaves the network 


with probability r7) = 1 — X DI» s That is, a customer of class c who 
completes service in queue i either goes to another queue or leaves the network. 
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Define A7 as the average rate of customers of class c that go through queue i, for 
i € (L,..., NJ andforc € (1,..., C]. Assume that the rate of arrivals of customers 
of a given class into a queue is equal to the rate of departures of those customers 
from the queue. Then the rates A7 should satisfy the following flow conservation 
equations: 


N C 
A my NN FUA ell. ui N} ce {l,..., C}. 
j=l d=1 

Let also X (t) = {X; (t), i = 1,..., N} where X; (t) is the configuration of queue 
i at time t > 0. That is, X; (t) is the list of customer classes in queue i, from the tail 
of the queue to the head of the queue. For instance, X; (t) = 132,312 if the customer 
at the tail of queue i is of class 1, the customer in front of her is of class 3, and so 
on, and the customer at the head of the queue and being served is of class 2. If the 
queue is empty, then X; (f) = [], where [] designates the empty string. 

One then has the following theorem. 


Theorem 5.6 (Product-Form Networks) 


(a) Let 0i = L..., N;c — 1,..., C} be a solution of the flow conservation 
equations. If ài :— De àF < ui fori =1,..., N, then X, is a Markov chain 
and its invariant distribution is given by 


m(x) = ATTN, gi (xi), 


where 


ei CN 
en 
8i(C1 +++ Cn) = ———,—— 
Hj 
and A is a constant such that 1; sums to one over all the possible configurations 
of the queues. 
(b) If the network is open in that every customer can leave the network, then the 
invariant distribution becomes 


n(x) = IP anii), 


where 


Tili: Cn) = (1 


[m 


In this case, under the invariant distribution, the queue lengths at time t are 
all independent, the length of queue i has the same distribution as that of an 
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Fig. 5.7 A network of guum Ol 
= o 7 oeo} foog. +o 


M/M/\ queue with arrival rate X; and service rate ui, and the customer classes 
are all independent and are equal to c with probability X7 / Xj. 


The proof of this theorem is the same as that of the particular example given in 
the next chapter. 


5.10.1 Example 


Figure 5.7 shows a network with two types of jobs. There is a single gray job that 
visits the two queues as shown. The white jobs go through the two queues once. The 
gray job models “hello” messages that the queues keep on exchanging to verify that 
the system is alive. For ease of notation, we assume that the service rates in the two 
queues are identical. 

We want to calculate the average time that the white jobs spend in the system 
and compare that value to the case when there is no gray job. That is, we want 
to understand the "cost" of using hello messages. The point of the example is to 
illustrate the methodology for networks where some customers never leave. The 
calculations show the following somewhat surprising result. 


Theorem 5.7 Using a hello message increases the expected delay of the white jobs 
by 50%. 


We prove the theorem in the next chapter. In that proof, we use Theorem 5.6 to 
calculate the invariant distribution of the system, derive the expected number L of 
white jobs in the network, then use Little's Law to calculate the average delay W of 
the white jobs as W — L/y. We then compare that value to the case where there is 
not gray job. 


5.11 References 


The literature on social networks is vast and growing. The textbook Easley and 
Kleinberg (2012) contains many interesting models and result. The text Shah (2009) 
studies the propagation of information in networks. 

The book Kelly (1979) is the most elegant presentation of the theory of queueing 
networks. It is readily available online. The excellent notes Kelly and Yudovina 
(2013) discuss recent results. The nice textbook Srikant and Ying (2014) explains 
network optimization and other performance evaluation problems. The books 
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Bremaud (2017) and Lyons and Perez (2017) are excellent sources for deeper studies 
of networks. The text Walrand (1988) is more clumsy but may be useful. 


5.12 Problems 


Problem 5.1 There are K users of a social network who collaborate to estimate 
some quantity by exchanging information. At each step, a pair (i, j) of users is 
selected uniformly at random and user j sends a message to user i with his estimate. 
User i then replaces his estimate by the average of his estimate and that of user j. 
Show that the estimates of all the users converge in probability to the average value 
of the initial estimates. This is an example of consensus algorithm. 


Hint Let X, (i) be the estimate of user i at step n and X, the vector with components 
X5 (i). Show that 


E[Xn41G) | Xn] = (0 7 00X4G) + QA, 
where œ = 1/(2(K — 1)) and A = Y^; Xo(i)/K. Consequently, 
E[|Xn41@) — Al | Xn] = (10 — 0)1X,G) — Al, 

so that 

E[|Xn €) — Al] = (1 — o) El Xa (i) — AI] 
and 

E[|X5G) — Al] > 0. 

Markov’s inequality then shows that P(|X,(i) — A| > €) — 0 for any e > 0. 
Problem 5.2 Jobs arrive at rate y in the system shown in Fig. 5.8. With probability 
p, a customer is sent to queue 1, independently of the other jobs; otherwise, the job 
is sent to queue 2. For i = 1,2, queue i serves the jobs at rate u;. Find the value 
of p that minimizes the average delay of jobs in the system. Compare the resulting 
average delay to that of the system where the jobs are in one queue and join the 
available server when they reach the head of the queue, and the fastest server if both 
are idle, as shown in the bottom part of Fig. 5.8. 
Hint 'The system of the top part of the figure is easy to analyze: with probability 
p, à job faces the average delay 1/(u; — y p) in the top queue and with probability 
1 — p the job faces the average delay 1/(u2 — y (1 — p)), One the finds the value of 


p that minimizes the expected delay. For the system in the bottom part of the figure, 
the state is n with n > 2 when there are at least two jobs and the two servers are 
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Fig. 5.8 Optimizing p (top) m 

versus joining the free server y p OO OW}. 
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Fig. 5.9 The state transition 
diagram. Here, u := p1 + ua 
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busy, or (1, s) where s € (1, 2} indicates which server is busy, or 0 when the system 
is empty. One then needs to find the invariant distribution of the state, compute the 
average number of jobs, and use Little's Law to find the average delay. The state 
transition diagram is shown in Fig. 5.9. 


Problem 5.3 This problem compares parallel queues to a single queue. There are 
N servers. Each server serves customers at rate u. The customers arrive at rate 
Ni. In the first system, the customers are split into N queues, one for each server. 
Customers arrive at each queue with rate A. The average delay is that of an M/M/1 
queue, i.e., 1/(u — A). In the second system, the customers join a single queue. The 
customer at the head of the queue then goes to the next available server. Calculate 
the average delay in this system. Write a Python program to plot the average delays 
of the two systems as a function p :— 4/, for different values of N. 


Hint The state diagram is shown in Fig. 5.10. 


Problem 5.4 In this problem, we explore a system of parallel queues where the 
customers join the shortest queue. Customers arrive at rate NA and there are N 
queues, each with a server who serves customers at rate 4 > A. When a customer 
arrives, she joins the shortest queue. The goal is to analyze the expected delay in 
the system. Unfortunately, this problem cannot be solved analytically. So, your task 
is to write a Python program to evaluate the expected delay numerically. The first 
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Fig. 5.11 The system 


step is to draw the state transition diagram. Approximate the system by discarding 
customers who arrive when there are already M customers in the system. The second 
step is to write the balance equations. Finally, one writes a program to solve the 
equations numerically. 


Problem 5.5 Figure 5.11 shows a system of N queues that serve jobs at rate u. If 
there is a single job, it takes on average N /u time units for it to go around the circle. 
Thus, the average rate at which a job leaves a particular queue is 4/N. Show that 
when there are two jobs, this rate is 2u/(N + 1). 
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Application: Social Networks, Communication Networks 
Topics: Continuous-Time Markov Chains, Product-Form Queueing Networks 


6.1 Social Networks 
We provide the proofs of the theorems in Sect. 5.1. 


Theorem 6.1 (Spreading of a Message) Let Z be the number of nodes that 
eventually receive the message. 


(a) If w < 1, then P(Z < oo) = land E(Z) < oo; 
(b) If u > 1, then P(Z = œ) > 0. 


Proof For part (a), let X, be the number of nodes that are n steps from the root. If 
Xn = k, we can write Xn+1 = Yı +-+- + Yk where Y; is the number of children of 
node j at level n. By assumption, E(Y;) = yw for all j. Hence, 


E[Xn41 | Xn = k] = E(Y1 +--+ + Yk) = wk. 


Hence, E[Xn41 | Xn] = Xn. Taking expectations shows that E(X541) = 
ILE(X5), n = 0. Consequently, 


E(X4) = u',n 7 0. 
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Now, the sequence Zn = Xo +---+ Xn is nonnegative and increases to Z = 
Xo Zn. By MCT, it follows that E(Z,) > Z. But 


_ n+1 
E(Zn) = po +a" o. 
l—p 


Hence, E(Z) = 1/(1 — u) < oo. Consequently, P(Z < oo) = 1. 

For part (b), one first observes that the theorem does not state that P(Z = oo) = 
1. For instance, assume that each node has three children with probability 0.5 and 
has no child otherwise. Then u = 1.5 > 1 and P(Z = 1) = P(X; = 0) = 0.5, so 
that P(Z = oo) x 0.5 < 1. We define Xn, Y;, and Z, as in the proof of part (a). 

Let a, = P(X, > 0). Consider the X, children of the root. Since a+; is the 
probability that there is one survivor after n + 1 generations, it is the probability that 
at least one of the X, children of the root has a survivor after n generations. Hence, 


1 = O41 = E(1 — o,)*1), n > 0. 


Indeed, if X; = k, the probability that none of the k children of the root has a 
survivor after n generations is (1 — a, )*. Hence, 


On41 = 1 — E((1 — &n)¥*!) =: g(an),n > 0. 


Also, a = 1. As n — oo, one has a, — a* = P(X, > 0, for all n). Figure 6.1 
shows that o* > 0. The key observations are that 


g(0) 20 
g(1)2 P(X, > 0) < 1 
g(0) = E(X1(1— @)*!7) [42959 u > 1 


£'(1) = E(QG(1 — a) |, 4 0, 


so that the figure is as drawn. o 


Theorem 6.2 (Cascades) Assume py = p € (0, 1] for all k > 1. Then, all nodes 
turn red with probability at least equal to 0 where 


1— 
à-ep|--—*]. 
P 


Proof The probability that node n does not listen to anyone is an = (1 — p)”. Let 
X be the index of the first node that does not listen to anyone. Then 
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Fig. 6.1 The proof that 


a*>0 
0 
P(X >n) = (1n2a)(1 — a) +++ (1 — an) < exp{—ay — +++ — an} 
1 
=op] (1— p) - ( pr. 
P 
Now, 


J= 
P(X = œ) = lim P(X > n) > ev[-—*] -— 
n p 


Thus, with probability at least 0, every node listens to at least one previous node. 
When that is the case, all the nodes turn red. To see this, assume that n is the first 
blue node. That is not possible since it listened to some previous nodes that are all 
red. o 


6.2 Continuous-Time Markov Chains 


Our goal is to understand networks where packets travel from node to node until 
they reach their destination. In particular, we want to study the delay of packets 
from source to destination and the backlog in the nodes. 

It turns out that the analysis of such systems is much easier in continuous time 
than in discrete time. To carry out such analysis, we have to introduce continuous- 
time Markov chains. We do this on a few simple examples. 
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6.2.1 Two-State Markov Chain 


Figure 6.2 illustrates a random process {X;,t > 0} that takes values in (0, 1}. A 
random process is a collection of random variables indexed by t > 0. Saying that 
such a random process is defined means that one can calculate the probability that 


{Xa = xi, Xy = x2,..., Xi, = Xn} for any value of n > 1, anyO < t < 
+++ < tn, and x1, ..., Xn € (0, 1}. We explain below how one could calculate such a 
probability. 


We call X; the state of the process at time t. The possible values (0, 1} are also 
called states. The state X; evolves according to rules characterized by two positive 
numbers A and u. As Fig. 6.2 shows, if Xo = 0, the state remains equal to zero 
for a random time Tọ that is exponentially distributed with parameter A, thus with 
mean 1/A. The state X; then jumps to 1 where it stays for a random time 7; that is 
exponentially distributed with rate u, independent of 7o, and so on. The definition is 
similar if Xo = 1. In that case, X; keeps the value 1 for an exponentially distributed 
time with rate jz, then jumps to O, etc. 

Thus, the pdf of Tọ is 


fny(t) = Aexp{—At}1{t > 0]. 
In particular, 
P(To € e) © frm (Oe = re, fore < 1. 


Throughout this chapter, the symbol ~ means “up to a quantity negligible compared 
to e." It is shown in Theorem 15.3 that exponentially distributed random variable is 
memoryless. That is, 


PITo >t+s | To >t] = P(To > s),s,t > 0. 


The memoryless property and the independence of the exponential times Tk 
imply that {X;,t > 0} starts afresh from X, at time s. Figure 6.3 illustrates that 
property. Mathematically, it says that given {X;,t < s] with X, = k, the process 
{Xs+r, t > 0} has the same properties as (X;, t > 0] given that Xo = k, fork = 0,1 
and for any s > 0. Indeed, if X, = 0, then the residual time that X; remains in 0 
is exponentially distributed with rate A and is independent of what happened before 


Fig. 6.2 A random process 
on (0, 1] 
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Fig. 6.3 The process X; bo Ua t» 0] 
È ? — 


starts afresh from X, at time s 


time s, because the time in 0 is memoryless and independent of the previous times 
in 0 and 1. This property is written as 


PUXs4r,¢ > 0} € A| Xs =k; X,t <s] = PHX: t > O} Ee A| Xo =k], 


for k = 0, 1, for all s > 0, and for all sets A of possible trajectories. A generic set 
A of trajectories is 


A= {(x%,,t=0)€ C, | Xr = İl,- Xq = in} 


for given 0 < t1 < --- < tn and ij, ..., in € {0, 1}. Here, C+ is the set of right- 
continuous functions of t > 0 that take values in (0, 1}. 

This property is the continuous-time version of the Markov property for Markov 
chains. One says that the process X; satisfies the Markov property and one calls 
{X;, t > 0] is a continuous-time Markov chain (CTMC). 

For instance, 


P[Xs+2.5 = 1, Xs44 = 0, X451 =0| X; = 0; X t <5] 
= P[X55 = 1, X4 = 0, X51 = 0 | Xo = 0]. 


The Markov property generalizes to situations where s is replaced by a random 
t that is defined by a causal rule, i.e., a rule that does not look ahead. For instance, 
as in Fig. 6.4, t can be the second time that X; visits state 0. Or t could be the 
first time that it visits state O after having spent at least 3 time units in state 1. The 
property does not extend to non-causal times such as one time unit before X, visits 
state 1. Random times r defined by causal rules are called stopping times. This more 
general property is called the strong Markov property. To prove this property, one 
conditions on the value s of t and uses the fact that the future evolution does not 
depend on this value since the event {t = s} depends only on (X;, t < s). 

For 0 < e < 1 one has 
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Fig. 6.4 The process X; 
starts afresh from X, at the 
stopping time 7 


Fig. 6.5 The state transition À 
diagram 


Indeed, the process jumps from 0 to 1 in e time units if the exponential time in O is 
less than e, which has probability approximately Ae. 
Similarly, 


P[X:4¢ =0 | X; = 1] © pe. 


We say that the transition rate from 0 to 1 is equal to A and that from 1 to 0 is equal 
to u to indicate that the probability of a transition from 0 to 1 in e units of time is 
approximately Ae and that from 1 to 0 is approximately ue. 
Figure 6.5 illustrates these transition rates. This figure is called the state 
transition diagram. 
The previous two identities imply that 
P(Xt+e = 1) = P(X; = 0, Xe = D) + P(X: = 1, Xe = 1) 
= P(X;=0)P[Xi+e=1 | Xic0 E P(X; 1) PL Xt+e=1 | X= 1) 
x P(X, = O)Ae+ P(X, = D(1 — P[Xt4. = 0| X; = 1p 
x~ P(X; = 0)Ae + P(X; = 1) — pe). 


Also, similarly, one finds that 
P(Xt+e = 0) © P(X; = ONC — Ae) + P(X, = l)ue. 


We can write these identities in a convenient matrix notation as follows. For 
t > 0, one defines the row vector 7; as 


7; = (P(X; = 0), P(X = 1)]. 
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One also defines the transition rate matrix Q as follows: 
—A x 
9 ~ | | l 
H =H 
With that notation, the previous identities can be written as 


Tite © 7t; (Y + Qe), 


where I is the identity matrix. Subtracting 2; from both sides, dividing by e, and 
letting € — 0, we find 


- Q (6.1) 
—7z, = m, Q. i 
dt t t 
By analogy with the scalar equation dx;/dt = ax; whose solution is x; = 
xo exp{at}, we conclude that 
7t = To exp{ Qt}, (6.2) 
where 
_ 1 2,2 1 3,3 
exp( Qt) := I + Qi+ 40 t+ 312 te. 
Note that 
d E 2 1 3,2 = 
3; OPLO = 0+ Q + Q'i + QI? +- = QexptQr). 


Observe also that zz; = zt for all t > O if and only if zo = x and 


zQ-0. (6.3) 


Indeed, if z; = x for all t, then (6.1) implies that 0. = LA = mQ = TQ. 
Conversely, if 79 = z with z Q = 0, then 


1 1 
7; = To exp{Qt} = x exp{ Qt} = x (1+ Qt + Jer + 49 Tee ) =. 
These equations x Q = 0 are called the balance equations. They are 


imo sa) 7 i ]-* 
u -u 
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Fig. 6.6 A discrete-time A€ 
approximation of X, 


1—Ae (0) (1) 1 — pe 


L.e., 


7 (0)(—2) + x (1)u = 0 
z (0). — z (1)u = 0. 


These two equations are identical. To determine x, we use the fact that zr (0) + 
7 (1) = 1. Combined with the previous identity, we find 


" À 
[x(0), x(1)] = E | 


The identity mpe 7 m;(1+ Qe) shows that one can view {X,<,n = 0,1,...} 
as a discrete-time Markov chain with transition matrix P = I+ Qe. Figure 6.6 
shows the transition diagram that corresponds to this transition matrix. The invariant 
distribution for P is such that x P = x, i.e., z (I-- Qe) = z, so that xz Q = 0, not 
surprisingly. 

Note that this discrete-time Markov chain is aperiodic because states have self- 
loops. Thus, we expect that 


Ine > T, as n — oo. 
Consequently, we expect that, in continuous time, 


Tt — T, ast — oco. 


6.2.2 Three-State Markov Chain 


The previous Markov chain alternates between the states 0 and 1. More general 
Markov chains visit states in a random order. We explain that feature in our next 
example with 3 states. Fortunately, this example suffices to illustrate the general 
case. We do not have to look at Markov chains with 4, 5,... states to describe the 
general model. 
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Fig. 6.7 A three-state 
Markov chain 


q(2, 0) 


x, Perm) Exp(a) 


In the example shown in Fig. 6.7, the rules of evolution are characterized 
by positive numbers 4(0,1),4(0,2), q(1, 2), and qg(2,0). One also defines 
qo. q1, q2, l (0, 1), and T (0, 2) as in the figure. 

If Xo = O, the state X, remains equal to 0 for some random time Tọ that 
is exponentially distributed with rate go. At time Tọ, the state jumps to 1 with 
probability I (0, 1) or to state 2 otherwise, with probability I'(0, 2). If X; jumps 
to 1, it stays there for an exponentially distributed time 7; with rate qı that is 
independent of To. More generally, when X, enters state k, it stays there for a 
random time that is exponentially distributed with rate qg that is independent of the 
past evolution. From this definition, it should be clear that the process X; satisfies 
the Markov property. 

Define m, = [7;(0), 7; (1), 7; (2)] where z;(k) = P(X; = k) fork = 0,1, 2. 
One has, for 0 < e < 1, 


P[Xt+e = 1| X; = 0] © qoe r (0, 1) = q(0, De. 
Indeed, the process jumps from 0 to | in e time units if the exponential time with 
rate qo is less than e and if the process then jumps to | instead of jumping to 2. 
Similarly, 


P[Xi&e =2| X, = 0] © qoe (0, 2) = q(0, 2e. 


Also, 
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P[Xi4e = 1 | X: = 1]71— qiue, 


since this is approximately the probability that the exponential time with rate qı is 
larger than e. Moreover, 


P[Xt+e = 1| X; 22] © 0, 


because the probability that both the exponential time with rate q» in state 2 and the 
exponential time with rate qo in state O are less than e is roughly (q5€) x (q1€), and 
this is negligible compared to e. 

These observations imply that 


T; íc(1) = P(X; =0, Xipe = 1) + P(X = 1, Xie = 1) + P(X; = 2, Xie = 10) 
= P(X;=0) P[Xi4e=1 | X,20H-P(X; 1) P[X;4e—1 | X= 1] 
+ P(X; = 2)P[Xi4e = 1| X; = 2] 
~ m (0)q (0, De 4- zt; (1)(1 — aie). 


Proceeding in a similar way shows that 


Tice (0) © 705 (0)(1 — qoe) + 7: (2)q (2, Oe 
Trte (2) © mad, 2)e + n; (2)(1 — qne. 


Similarly to the two-state example, let us define the rate matrix Q as follows: 


—qo 4(0, 1) q(0,2) 
Q= 0  -—qi 4,1) 
qQ,0 0 =q 
The previous identities can then be written as follows: 


Tite © Tt [IE + Qe]. 


Subtracting 7; from both sides, dividing by e, and letting € — 0 then shows that 


d 
a = Tt Q. 
As before, the solution of this equation is 


7; = noexp(Qt), t > 0. 


The distribution z is invariant if and only if 
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Fig. 6.8 The transition 
matrix of the discrete-time 
approximation 


zrQ=0. 
Once again, we note that {X„e, n = 0, 1,...} is approximately a discrete-time 


Markov chain with transition matrix P = J + Qe shown in Fig. 6.8. This Markov 
chain is aperiodic, and we conclude that 


P(Xne = k) > n(Kk), asn > oo. 
Thus, we can expect that 
7; — T, ast > oo. 


Also, since Xņe is irreducible, the long-term fraction of time that it spends in the 
different states converge to zr, and we can then expect the same for X;. 


6.2.3 General Case 


Let 2 be a countable or finite set. The process {X;,t > 0} is defined as 
follows. One is given a probability distribution z on 2 and a rate matrix Q = 
(qi, jhi je) 

By definition, Q is such that 


q(, j) z 0, Vi # j and Y qG, j) = 0, Vx. 
j 


Definition 6.1 (Continuous-Time Markov Chain) A continuous-time Markov 
chain with initial distribution z and rate matrix Q is a process {X;,t > 0) such 
that P(Xo = i) = z (i). Also, 


P[Xrje = J|IX; = i, Xu, u < t] = Mi = j} + eqG, j) + o(e). 
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Fig. 6.9 Construction of a 
continuous-time Markov 
chain 


This definition means that the process jumps from i to j 4 i with probability 
q(i, j)eine < lI time units. Thus, q(i, j) is the probability of jumping from i to j, 
per unit of time. Note that the sum of these expressions over all j gives 1, as should 
be. 

One construction of this process is as follows. Say that X; = i. One then chooses 
a random time r that is exponentially distributed with rate q; :— —q (i, i). At time 
t + 7, the process jumps and goes to state y with probability T (i, j) = q(i, j)/qi 
for j z i (Fig. 6.9). 

Thus, if X, = i, the probability that X;+e = j is the probability that the process 
jumps in (t, t + €), which is qie, times the probability that it then jumps to j, which 
is I (i, j). Hence, 


; ; (i, j) - 
P[Xi4 = (IX: =i] — qif g = q(i, Je, 


L 


up to o(€). Thus, the construction yields the correct transition probabilities. 
As we observed in the examples, 


d 
— Nt = NQ, 


dt 
so that 
7, = To exp{ Qt}. 


Moreover, a distribution z is invariant if and only if it solves the balance equations 
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0-—zQ. 
These equations, state by state, say that 


(igi = 9 rG, i. vi e Z. 
jfi 


These equations express the equality of the rate of leaving a state and the rate of 
entering that state. 
Define 
Pi(i, j) = P[X;4;— j | Xs =i], fori, j € X and s,t > O. 
The Markov property implies that 


P(X, — iesu Xy = in) = P (Xn = 11) Phn-n (i1, i2) Ps 52, i3) -> Pip —ty_1 Gn-1 in), 


for allij,...,i, € Z andallO < t; <--> < tn. 
Moreover, this identity implies the Markov property. Indeed, if it holds, one has 
P[Xi,,, = imt +++) Xt m in | Xi m i1, -s Xim = im] 
(OP(X =i... Xu = in) 
P(Xj = il, ets X tin = im) 
P(X FA i) P5 (i1, i?) Pg -n (i2, i3) AE Py —ty—1 (in-1, in) 
P(X, = i) Phn-t (i1, i5) Pp -n (i2, i3) So P$ hes (im—2, im—1) 


= P cta (im-1, im) “RS Pty a n- t, in). 


Hence, 


P[Xs, , = Impr Xt = in | Xi = i1,- Xi, = im] 
P(X,, = lm) Pim ts a m-1 tm) +++ Pt, a, a nds tn) 
P(X,, 4, = in-1) Pu, n, 4 (İm—1, im) 
POL acus. 2) 


n 


P(Xtq_ = im-1) 


= P[X;, = ims DESEE: Xt, = in | Xia = im—1]- 
If X, has the invariant distribution, one has 
P(X; — i... Xi, = in) = n (i1) Pr—-t (i1, i2) P—t U2, 13) +++ Pr —t,_, Gn-1, in), 


for allij,...,i, € Z andallO < t; <--> < tn. 
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Here is the result that corresponds to Theorem 15.1. We define irreducibility, 
transience, and null and positive recurrence as in discrete time. There is no notion 
of periodicity in continuous time. 


Theorem 6.1 (Big Theorem for Continuous-Time Markov Chains) 
Consider a continuous-time Markov chain. 


(a) If the Markov chain is irreducible, the states are either all transient, all positive 
recurrent, or all null recurrent. We then say that the Markov chain is transient, 
positive recurrent, or null recurrent, respectively. 

(b) If the Markov chain is positive recurrent, it has a unique invariant distribution 
x and m (i) is the long-term fraction of time that X, is equal to i. Moreover, the 
probability n; (i) that the Markov chain X, is in state i converges to m (i). 

(c) If the Markov chain is not positive recurrent, it does not have an invariant 
distribution and the fraction of time that it spends in any state goes to zero. 


6.2.4 Uniformization 


We saw earlier that a CTMC can be approximated by a discrete-time Markov chain 
that has a time step € < 1. There are two other DTMCs that have a close relationship 
with the CTMC: the jump chain and the uniformized chain. We explain these chains 
for the CTMC X, in Fig. 6.7. 

The jump chain is X, observed when it jumps. As Fig. 6.7 shows, this DTMC 
has a transition matrix equal to I” where 


T" i.D/qi, ifi £j 
ria {a JD/q rus 


Let v be the invariant distribution of this jump chain. That is, v = vI. Since v(i) is 
the long-term fraction of time that the jump chain is in state 7, and since the CTMC 
X; spends an average time 1/qi in state i whenever it visits that state, the fraction of 
time that X, spends in state i should be proportional to v(i)/q;. That is, one expects 


zG) = Av()/qi 


for some constant A. That is, one should have 


Y lvo)/aiaG. j) = 0. 


J 


To verify that equality, we observe that 
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POl, ) = DO vra, | + vq, i)/qi = vO — vO = 0. 
J jzi 


We used the fact that v" = v and q(i, i) = —qi. 

The uniformized chain is not the jump chain. It is a discrete-time Markov chain 
obtained from the CTMC as follows. Let A > q; for all i. The rate at which X, 
changes state is q; when it is in state i. Let us add a dummy jump from i to i with 
rate A — q;. The rate of jumps, including these dummy jumps, of this new Markov 
chain Y, is now constant and equal to A. 

The transition matrix P of Y, is such that 


E Qm 

q(,jJ)X, ifiz;j. 
To see this, assume that Y, = i. The next jump will occur with rate A. With 
probability (A — q;)/A, it is a dummy jump from i to i. With probability g;/ it 
is an actual jump where Y; jumps to j + i with probability I (i, j). Hence, Y, 
jumps from i to i with probability (A — q;)/A and from i to j Z i with probability 
(qi / ^), J) = 4G, D) /^. 

Note that 


1 
P=I+ i Q, 
where I is the identity matrix. 

Now, define Z, to be the jump chain of Y;, i.e., the Markov chain with transition 
matrix P. Since the jumps of Y; occur at rate A, independently of the value of the 
state Y;, we can simulate Y; as follows. Let N, be a Poisson process with rate A. The 
jump times (t1, fo,...} of N; will be the jump times of Y;. The successive values of 
Y, are those of Z,. Formally, 


Y; = ZN,‘ 


That is, if N; = n, then we define Y, = Z,. Since the CTMC Y, spends 1/A on 
average between jumps, the invariant distribution of Y; should be the same as that 
of X,, i.e., x. To verify this, we check that x P = x, i.e., that 


x (1+ 20) -—. 


That identity holds since mQ = 0. Thus, the DTMC Z, has the same invariant 
distribution as X;. Observe that Z; is not the same as the jump chain of X;. Also, it 
is not a discrete-time approximation of X;. This DTMC shows that a CTMC can be 
seen as a DTMC where one replaces the constant time steps by i.i.d. exponentially 
distributed time steps between the jumps. 
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6.2.5 Time Reversal 

As a preparation for our study of networks of queues, we note the following result. 
Theorem 6.2 (Kelly’s Lemma) Let Q be the rate matrix of a Markov chain on X. 
Let also Q be another rate matrix on X . Assume that x is a distribution on X and 


that 


qi = ĝi, i € X and 
z()q(, j) = n ()qG. i), Vi z j. 


Then nQ = 0. 
| 
Proof We have 
9 nGwG.D- Vo pO. D = PO dd. D = POH = POG. 
jzi j#i J#i 
so that 7 Q = 0. o 


The following result explains the meaning of Q in the previous theorem. We state 
it without proof. 


Theorem 6.3 Assume that X, has the invariant distribution x. Then X, reversed in 
time is a Markov chain with rate matrix Q given by 
m(j)q(j, i) 


qi, j) = x) 


6.3 Product-Form Networks 


Theorem 6.4 (Invariant Distribution of Network) Assume Ay < puk and let 
Pk = àk/uk, for k = 1,2,3. Then the Markov chain X, has a unique invariant 
distribution x that is given by 


T (x1, X2, X3) = T1 (X1) 12 x2)mt3 (x3) 

m(n) = (1 — ppg, n => O,k— 1,2 

703(1,42,..-,4n) = p(ai)p(a2) +++ p(an)(1 — p3)p4, 
n>0O0,q € {1,2},k =1,...,n, 


6.3 Product-Form Networks 109 


Fig. 6.10 The network (top) 


and a guess for its pa ET * PR A Di 

time-reversal (bottom). The yı : H1 : 

bottom network is obtained ü p 
3 2 


from the top one by reversing À2 2 
the flows of customers. It is a ^] = 


bold guess that the arrivals "ya u2 
have exponential inter-arrival p 


times and their rates are [ | 


independent of the current 
> 
Di <À <3 Y + Y2 
HO re 


queue lengths 


| M 


where p(1) = X1/(X1 + 42) and p(2) = à2/(à1 + A2). 


Proof Figure 6.10 shows a guess for the time-reversal of the network. 
Let Q be the rate matrix of the top network and Q that of the bottom one. Let 


also z be as stated in the theorem. We show that x, Q, Q satisfy the conditions of 
Kelly's Lemma. 
For instance, we verify that 


7 ([2, 2, [1, 1, 2, 1] Da (I3, 2, [1, 1, 2, 1]; [4, 2, [1, 1, 21) 
= x ([4, 2, [1, 1, 2] Da ([4, 2, [1, 1, 2]. [3, 2, [1, 1, 2, 1]. 


Looking at the figure, we can see that 


q(3,2, [1, 1,2, 1], [4, 2, [1, 1, 2]]) = uapi 
q([4. 2, [1, 1, 2]], (3, 2, [1, 1, 2, 1]] = wip. 


Thus, the previous identity reads 


7 ([3, 2, [1, 1,2, 1] usp1 = x (4, 2, 01,1, 2]D pipi. 


7 ([3, 2, [1, 1,2, 1] u3 = x ([4, 2, [1, 1, 2] ui. 


Given the expression for 77, this is 
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(1 — pip? x (1 — 2)p3 x PC) PA) p2) — 93) p3 M3 
= (1— pipi x (l — p2)p3 x PDPP = 93) p3 141. 


After simplifications, this identity is seen to be equivalent to 


p(l)esua = pipa, 
i.e., 


À1 A3 M 

1 —H3 = —H1 

A3 U3 ual 
and this equation is seen to be satisfied. A similar argument shows that Kelly’s 
lemma is satisfied for all pairs of states. o 


6.4 Proof of Theorem 5.7 


The first step in using the theorem is to solve the flow conservation equations. Let 
us call class 1 that of the white jobs and class 2 that of the gray job. Then we see 
that 


al = =y, A =A a 


solve the flow conservation equations for any o > 0. We have to assume y < y for 
the services to be able to keep up with the white jobs. With this assumption, we can 
choose o small enough so that A; = Az = A:— y + à < min(gi, u2}. 

The second step is to use the theorem to obtain the invariant distribution. It is 


7T (x1, x2) = Ah(xi)h(x2) 


ny (xj) ng (xj) 
y id i i 
h(xi) - (Z) (=) = pre y A 
u H 


where o1 = y/u, p» = a/, and ne(x) is the number of jobs of class c in xj, for 
c = 1,2. To calculate A, we note that there are n + 1 states x; with n class 1 jobs 
and 1 class 2 job, and 1 state x; with n classes 1 jobs and no class 2 job. Indeed, the 
class 2 customer can be inn + 1 positions in the queue with the n customers of class 
1. 


with 


Also, all the possible pairs (x1, x2) must have one class 2 customer either in 
queue | or in queue 2. Thus, 
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oo oo 
1= J, nG1,21) 2 A», $ Gin), 
(x1.X2) m=0 n=0 
where 
G(m, n) = (m + ptt" p; + (n + Dot" po 


In this expression, the first term corresponds to the states with m class 1 customers 
and one class 2 customer in queue | and n customers of class 1 in queue 2; the 
second term corresponds to the states with m customer of class 1 in queue 1, and n 
customers of class 1 and one customer of class 2 in queue 2. Thus, AG (m, n) is the 
probability that there are m customers of class | in the first queue and n customers 
of class 1 in the second queue. 

Hence, 


oo oo 
1=A D Di [om + Dort o+ (n+ Dolto = 2A 9 9 (n + Dor pa. 
m=0 n=0 m=0 n=0 
by symmetry of the two terms. Thus, 
oo oo 
1 —2App 5» + vet [x g . 
m=0 n=0 
To compute the sum, we use the following identities: 


oo 


bN =(1—p)', fr0<p< 1 
n=0 


and 
2 a+ De" = z Do = sd - pr - 1 = (= p)”. 
n=0 


Thus, one has 
1 = 2Ap(1 — pi), 
so that 


(1 — o)? 
20) — 


A= 
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Third, we calculate the expected number L of jobs of class 1 in the two queues. 
One has 


D 2 A(m +n)G(m, n) 


m=0 n= 


es oo oo 
=) AG m) + Do" p; + Y AQ n) t Dor" p 
On=0 m=0 n=0 


oo Oo 


25 2. AQ + nym + Dor" po 


m=0 n=0 


where the last identity follows from the symmetry of the two terms. Thus, 


oo oo 
L=2 a y Am(m + 1) p" p; +2 Y * dns t Dor" p; 


m=0 n=0 m=0 n=0 
oo oo oo oo 
m=0 n=0 m=0 n=0 
oo oo 
=2Ap ps m(m + LI (1 — pi)! + 2Apa(1 — py? » 
m=0 n=0 
To calculate the sums, we use the fact that 
oo oo 
9 mom + Dp" =p} mim Dp" 
m=0 m=0 
3? = m+1 ə? -1 
—052,0" = p750- p)! — 1] 
Var dp 
=2p(1—p)? 
Also, 
oo oo oo 
onn = pi Yon D = a1 — pi)”. 
n=0 n=0 n=0 
Hence, 


L = 2App x 20(1 — p)? x (1 — 9) ! + 2Apa(1 — o) ? x pr — pi)? 


= 6Ap2p1 (1 — pi) * 
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Substituting the value for A that we derived above, we find 


pi 


L=3 i 
l- pı 


Finally, we get the average time W that jobs of class 1 spend in the network: W = 


L/y. 
Without the gray job, the expected delay W' of the white jobs would be the sum 
of delays in two M/M/1 queues, i.e., W” = L'/y where 


piss. 
] e sg 


Hence, we find that 
W=1.5W’, 


so that using a hello message increases the average delay of the class 1 customers 
by 50%. 


6.5 References 


The time-reversal arguments are developed in Kelly (1979). That book also explains 
many other models that can be analyzed using that approach. See also Bremaud 
(2008), Lyons and Perez (2017), Neely (2010). 
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Application: Transmitting bits across a physical medium 
Topics: MAP, MLE, Hypothesis Testing 


7.1 Digital Link 


A digital link consists of a transmitter and a receiver. It transmits bits over some 
physical medium that can be a cable, a phone line, a laser beam, an optical fiber, an 
electromagnetic wave, or even a sound wave. This contrasts with an analog system 
that transmits signals without converting them into bits, as in Fig. 7.1. 

An elementary such system! consists of a phone line and, to send a bit 0, the 
transmitter applies a voltage —1 Volt across its end of the line for T seconds; to 
send a bit 1, it applies the voltage +1 Volt for T second. The receiver measures 
the voltage across its end of the line. If the voltage that the receiver measures is 
negative, it decides that the transmitter must have sent a 0; if it is positive, it decides 
that the transmitter sent a 1. This system is not error-free. The receiver gets a noisy 
and attenuated version of what the transmitter sent. Thus, there is a chance that a 0 
is mistaken for a 1, and vice versa. Various coding techniques are used to reduce the 
chances of such errors Fig. 7.2 shows the general structure of a digital link. 

In this chapter, we explore the operating principles of digital links and their 
characteristics. We start with a discussion of Bayes' rule and of detection theory. 
We apply these ideas to a simple model of communication link. We then explore a 
coding scheme that makes the transmissions faster. We conclude the chapter with a 


! We are ignoring many details of synchronization. 


© The Author(s) 2021 115 
J. Walrand, Probability in Electrical Engineering and Computer Science, 
https://doi.org/10.1007/978-3-030-49995-2 7 


116 7 Digital Link—A 


Fig. 7.1 Ananalog 
communication system 
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Fig. 7.2 Components of a digital link 


discussion of modulation and detection schemes that actual transmission systems, 
such as ADSL and Cable Modems, use. 


7.2 Detection and Bayes' Rule 

The receiver gets some signal S and tries to guess what the transmitter sent. We 
explore a general model of this problem and we then apply it to concrete situations. 
7.2.1 Bayes’ Rule 

The basic formulation is that there are N possible exclusive circumstances 
Ci,...,Cy under which a particular symptom S can occur. By exclusive, we 
mean that exactly one circumstance occurs (Fig. 7.3). Each circumstance C; has 
some prior probability p; and qi is the probability that S occurs under circumstance 


C;. Thus, 


pi = P(C;) and qj = P[S | Ci], fori = 1,..., N, 
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Fig. 7.3 The symptom and possible circumstances 
its possible circumstances. 

Here, p; = P(C;) and C; 

qi = PIS | Ci] qi 


Symptom 


S 


QN 


conditional 
priors C N probabilities 


where 


N 
pi = 0, q; € (0, 1] fori = 1,..., N and ` p; — I. 
i=l 


The posterior probability m; that circumstance C; is in effect given that S is 
observed can be computed by using Bayes’ rule as we explain next. One has 


P(C; and S) 
P(S) 
P(C;and$) ^  PISICi]P(C;) 
N I P(CjandS) | Y7 , P[SIC;]P(C;) 
Didi 
ibid 


Given the importance of this result, we state it as a theorem. 


m(i) = P[Ci|$] = 


Theorem 7.1 (Bayes’ Rule) One has 


Pidi 


— —,. i = 1,...,N. (7.1) 
2 j= Pj4j 


Ti = 


This rule is very simple but is a canonical example of how observations affect 
our beliefs. It is due to Thomas Bayes (Fig. 7.4). 
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Fig. 7.4 Thomas Bayes, 
1701-1761 


7.2.2 Circumstances vs. Causes 


In the previous section we were careful to qualify the C; as possible circumstances, 
not as causes. The distinction is important. Say that you go to a beach, eat an ice 
cream, and leave with a sunburn. Later, you meet a friend who did not go to the 
beach, did not eat an ice cream, and did not get sunburned. More generally, the 
probability that someone got sunburned is larger if that person ate an ice cream. 
However, it would be silly to qualify the ice cream as the cause of the sunburn. 
Unfortunately, confusing correlation and causation is a prevalent mistake. 


7.2.3 MAP and MLE 


Given the previous model, we see that the most likely circumstance under which the 
symptom occurs, which we call the Maximum A Posteriori (MAP) estimate of the 
circumstance given the symptom, is 


MAP = arg max 7; = arg max pidi. 
L I 


The notation is that if A(-) is a function, then arg max, h(x) is any value of x that 
achieves the maximum of A (-). Thus, if x* = arg max, h(x), then h(x*) > h(x) for 
all x. 

Thus, the MAP is the most likely circumstance, a posteriori, that is, after having 
observed the symptom. 

Note that if all the prior probabilities are equal, i.e., if p; — 1/N for all i, then 
the MAP maximizes q;. In general, the estimate that maximizes q; is called the 
Maximum Likelihood Estimate (MLE) of the circumstance given the symptom. That 
is, 


MLE = arg max qj. 
L 


That is, the MLE is the circumstance that makes the symptom most likely. 
More generally, one has the following definitions. 
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Definition 7.1 (MAP and MLE) Let (X, Y) be discrete random variables. Then 
MAP[X|Y =y]= arg max P(X =x and Y = y) 
and 
MLE[X|Y = y] = arg max PLY = y|X =x]. 
o 


These definitions extend in the natural way to the continuous case, as we will get 
to see later. 


Example: Ice Cream and Sunburn 

As an example, say that on a particular summer day in Berkeley 500 out of 100,000 
people eat ice cream, among which 50 get sunburned and that among the 99,500 
who do not eat ice cream, 600 get sunburned. Then, the MAP of eating ice cream 
given sunburn is No but the MLE is Yes. Indeed, we see that 


P(sunburn and ice cream) — 50 « P(sunburn and no ice cream) — 600, 


so that among those who have a sunburn, a minority eat ice cream, so that it is more 
likely that a sunburn person did not eat ice cream. Hence, the MAP if No. However, 
the fraction of people who have a sunburn is larger among those who eat ice cream 
(10%) than among those who do not (0.6%). Hence, the MLE is Yes. 


7.2.4 Binary Symmetric Channel 


We apply the concepts of MLE and MAP to a simplified model of a communication 
link. Figure 7.5 illustrates the model, called a binary symmetric channel (BSC). 

In this model, the transmitter sends a 0 or a | and the receiver gets the transmitted 
bit with probability 1 — p, otherwise it gets the opposite bit. Thus, the channel makes 
an error with probability p. We assume that if the transmitter sends successive bits, 
the errors are i.i.d. 


Fig. 7.5 The binary 1—p 


symmetric channel 0 Tug 0 
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Fig. 7.6 MAP for BSC. MAP|X|Y = 1] 

Here, aw = P(X = 1) and pis 

the probability of a channel Y MAP [X | Y = 0] 
error 1 — 


1 


0 P 05 lp 


Note that if p = O or p = 1, then one can recover exactly every bit that is 
sent. Also, if p = 0.5, then the output is independent of the input and no useful 
information goes through the channel. What happens in the other cases? 

Call X € (0, 1} the input of the channel and Y € (0, 1} its output. Assume that 
you observe Y = | and that P(X = 1) =a, so that P(X = 0) = 1 — a. We have 
the following result illustrated in Fig. 7.6. 


Theorem 7.2 (MAP and MLE for BSC) For the BSC with p < 0.5, 
MAP[X|Y =0] = l{a > 1— p, MAP[X|Y = 1] = l{a > p} 
and 
MLE[X|Y] — Y. 
[| 
To understand the MAP results, consider the case Y — 1. Since p « 0.5, we are 
inclined to think that X — 1. However, if o is small, this is unlikely. The result is 
that X = 1 is more likely than X = O if æ > p, i.e., if the prior is "stronger" than 


the noise. The case Y = 0 is similar. 


Proof In the terminology of Bayes' rule, the event Y — 1 is the symptom. Also, the 
prior probabilities are 


po = |—aand pı =a, 


and the conditional probabilities are 


qo = P[Y = 1|X = 0] = p and qı = P[Y = 1|X = 1)=1-p. 
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Hence, 


MAP[X|Y = 1] = arg max piqi. 
ic(0,1) 
Thus, 


0, otherwise. 


Hence, MAP[X|Y = 1] = l{a > pj. That is, when Y = 1, your guess is that 
X — lifthe prior that X — 1 is larger than the probability that the channel makes 
an error. 

Also, 


MLE[X|Y = 1] = arg max qi. 
ie(0,1) 


In this case, since p < 0.5, we see that MLE[X|Y = 1] = 1, because Y = 1 
is more likely when X = 1 than when X = 0. Thus, the MLE ignores the prior 
and always guesses that X — 1 when Y — 1, even though the prior probability 
P(X = 1) =a may be very small. 

Similarly, we see that 


MAP[X|Y = 0] = arg max pi(1 — qi). 
ic(0,1) 
Thus, 


0, otherwise. 


Hence, MAP[X|Y = 0] = 1(o > 1— p). Thus, when Y = 0, you guess that X = 1 
if X = 1is more likely a priori than the channel being correct. 
Also, MLE[X|Y = 0] = 0 because p < 0.5, irrespectively of a. o 


7.3 Huffman Codes 


Coding can improve the characteristics of a digital link. We explore Huffman codes 
in this section. 

Say that you want to transmit strings of symbols A, B, C, D across a digital link. 
The simplest method is to encode these symbols as 00, 01, 10, and 11, respectively. 
In so doing, each symbol requires transmitting two bits. Assuming that there is no 
error, if the receiver gets the bits 0100110001, it recovers the string BADAB. 
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Fig. 7.7 David Huffman, 
1925-1999 


Now assume that the strings are such that the symbols occur with the following 
frequencies: (A, 55%), (B, 30%), (C, 1096), (D, 5%). Thus, A occurs 55% of the 
time, and similarly for the other symbols. In this situation, one may design a code 
where A requires fewer bits than D. 

The Huffman code (Huffman 1952, Fig. 7.7) for this example is as follows: 


A-0,B—10C-110D-1ll. 
The average number of bits required per symbol is 
1 x 5596 +2 x 3096 +3 x 1096 + 3 x 5% = 1.6. 


Thus, one saves 20% of the transmissions and the resulting system is 25% faster 
(ah! arithmetics). Note that the code is such that, when there is no error, the receiver 
can recover the symbols uniquely from the bits it gets. For instance, if the receiver 
gets 110100111, the symbols are CBAD, without ambiguity. 

The reason why there is no possible ambiguity is that one can picture the bits as 
indicating the path in a tree that ends with a leaf of the tree, as shown in Fig. 7.8. 
Thus, starting with the first bit received, one walks down the tree until one reaches 
a leaf. One then repeats for the subsequent bits. In our example, when the bits are 
110100111, one starts at the top of the tree, then one follows the branches 110 and 
reaches leaf C, then one restarts from the top and follows the branches 10 and gets 
to the leaf B, and so on. Codes that have this property of being uniquely decodable 
in one pass are called prefix-free codes. 

The construction of the code is simple. As shown in Fig. 7.8, one joins the two 
symbols with the smallest frequency of occurrence, here C and D, with branches 0 
and 1 and assigns the group C D the sum of the symbol frequencies, here 0.15. One 
then continues in the same way, joining C D and B and assigning the group BCD 
the frequency 0.3 4- 0.15 — 0.45. Finally, one joins A and BC D. The resulting tree 
specifies the code. 

The following property is worth noting. 
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Fig. 7.8 Huffman code 


A B C D 
0.55. .0.3. 0.1. .0:05 
0 10 110 111 


Theorem 7.3 (Optimality of Huffman Code) The Huffman code has the smallest 
average number of bits per symbol among all prefix-free codes. 


a 
Proof See Chap. 8. Oo 


It should be noted that other codes have a smaller average length, but they are 
not symbol-by-symbol codes and are more complex. One code is based on the 
observation that there are only 2" likely strings of n >> 1 symbols, where 


H=- 3 x log). 
X 


In this expression, x is the frequency of symbol X and the sum is over all the 
symbols. This expression H is the entropy of the distribution of the symbols. Thus, 
by listing all these strings and assigning n H bits to identify them, one requires only 
nH bits for n symbols, or H bits per symbol (See Sect. 15.7.). 

In our example, one has 


H = —0.5510g5(0.55) — 0.310g5(0.3) 
— 0.11og5(0.1) — 0.05 log, (0.05) = 1.54. 


Thus, for this example, the savings over the Huffman code are not spectacular, but 
it is easy to find examples for which they are. For instance, assume that there are 
only two symbols A and B with frequencies p and 1 — p, for some p c (0, 1). The 
Huffman code requires one bit per symbol, but codes based on long strings require 
only — p log,(p) — (1 — p)log5(1 — p) bits per symbol. For p = 0.1, this is 0.47, 
which is less than half the number of bits of the Huffman code. 

Coding based on long strings of symbols are discussed in Sect. 15.7. 
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7.4 Gaussian Channel 


In the previous sections, we had a simplified model of a channel as a BSC. In this 
section, we examine a more realistic model of the channel that captures the physical 
characteristic of the noise. In this model, the transmitter sends a bit X € (0, 1) and 
the receiver gets Y where 


Y=X+Z. 


In this identity, Z =p .// (0, o?) and is independent of X. We say that this is an 
additive Gaussian noise channel. 

Figure 7.9 shows the densities of Y when X = 0 and when X = 1. Indeed, when 
X = x, we see that Y =p N (x, 0). 

Assume that the receiver observes Y. How should it decide whether X — 0 or 
X = 1? Assume again that P(X = 1) = pı = a and P(X 20) = pp=1-a. 

In this example, P[Y = y|X = 0] = 0 for all values of y. Indeed, Y is a 
continuous random variable. So, we must change a little our discussion of Bayes' 
rule. Here is how to do it. Pretend that we do not measure Y with infinite precision 
but that we instead observe that Y € (y, y + €) where 0 < e « 1. Thus, the 
symptom is Y € (y, y + €) and it now has a positive probability. In fact, 


qo = PIY € (y, y E €)IX = 0] & foe, 
by definition of the density fo(y) of Y when X = 0. Similarly, 
qı = P[Y € Q, y + eIX = 1] & fione. 

Hence, 


MAPIX|Y € (y, y + €)] = arg B pi fiCy)e. 
I m 


Since the result does not depend on e, we write 
MAP[X|Y = y] = arg max pi fiy). 
ie(0,1) 
Fig. 7.9 The pdf of Y is fo 


when X = 0 and fı when 
X21 foly) f(y) 
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Similarly, 


MLE[X|Y = y] = arg max fiy). 
ie(0,1) 
We can verify that 
l 2 po 
MAP[X|Y = y]=1 yz to log| — ]t. (7.2) 
pı 


Also, the resulting probability of error is 


P (vo. o?) > ; + o? log (2)) po 


Pi 


1 
+P (vaod < -+ 0° log (2)) Pi. 
2 pı 


Also, 
MLE[X|Y = y] = l{y > 0.5}. 


If we choose the MLE detection rule, the system has the same probability of error 
as a BSC channel with 


p = p(o?) := P(N 0, 0°) > 0.5) = P (vo. 1) > z) 


Simulation 
Figure 7.10 shows the simulation results when œ = 0.5 and o = 1. The code is in 
the Jupyter notebook for this chapter. 


7.4.1 BPSK 


The system in the previous section was very simple and corresponds to a practical 
transmission scheme called Binary Phase Shift Keying (BPSK). In this system, 
instead of sending a constant voltage for T seconds to represent either a bit O or 
a bit 1, the transmitter sends a sine wave for T seconds and the phase of that sine 
wave depends on whether the transmitter sends a 0 or a 1 (Fig. 7.11). 

Specifically, to send bit 0, the transmitter sends the signal 


So = {so(t) = A sin(2x ft), t € [0, T]}. 


Here, T is a multiple of the period, so that fT = k for some integer k. To send 
a bit 1, the transmitter sends the signal s; = —so. Why all this complication? The 
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Simulation of AGN channel 


0.8 
0.6 Fraction of incorrectly received bit$ 


e Transmitted bits Xn 


* Received bits Rn 
Expected BER for g = 1.0 


0.4 


Fig. 7.11 The signal that the 


transmitter sends when using Seres P Rd 


BPSK 


signal is a sine wave around frequency f and the designer can choose a frequency 
that the transmission medium transports well. For instance, if the transmission is 
wireless, the frequency f is chosen so that the antennas radiate and receive that 
frequency well. The wavelength of the transmitted electromagnetic wave is the 
speed of light divided by f and it should be of the same order as the physical length 
of the antenna. For instance, 1GHz corresponds to a wavelength of one foot and it 
can be transmitted and received by suitably shaped cell phone antennas. 

In any case, the transmitter sends the signal s; to send a bit i, for i = 0, 1. The 
receiver attempts to detect whether so or s; = —so was sent. To do this, it multiplies 
the received signal by a sine wave at the frequency f, then computes the average 
value of the product. That is, if the receiver gets the signal r = {7;,0 < t < T}, it 
computes 


1 T 
zÍ r; sin(2z f t)dt. 
T Jo 


You can verify that if r = so, then the result is A/2 and if r = s;, then the result is 
— A/2. Thus, the receiver guesses that bit 0 was transmitted if this average value is 
positive and that bit 1 was transmitted otherwise. 

The signal that the receiver gets is not s; when the transmitter sends s;. Instead, 
the receiver gets an attenuated and noisy version of that signal. As a result, after 
doing its calculation, the receiver gets B + Z or — B + Z where B is some constant 
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that depends on the attenuation, Z is a 1 (0, o?) random variable and o? reflects 
the power of the noise. 

Accordingly, the detection problem amounts to detecting the mean value of a 
Gaussian random variable, which is the problem that we discussed earlier. 


7.5 Multidimensional Gaussian Channel 


When using BPSK, the transmitter has a choice between two signals: so and s4. 
Thus, in T seconds, the transmitter sends one bit. To increase the transmission 
rate, communication engineers devised a more efficient scheme called Quadrature 
Amplitude Modulation (QAM). When using this scheme, a transmitter can send a 
number K of bits every T seconds. The scheme can be designed for different values 
of k. When k = 1, the scheme is identical to BPSK. For k > 1, there are 2* different 
signals and each one is of the form 


a cos(2z ft) + bsin(2z ft), 


where the coefficients (a, b) characterize the signal and correspond to a given string 
of k-bits. These coefficients form a constellation as shown in Fig. 7.12 in the case 
of QAM-16, which corresponds to k — 4. 

When the receiver gets the signal, it multiplies it by 2 cos(27 ft) and computes 
the average over T seconds. This average value should be the coefficient a if 
there was not attenuation and no noise. The receiver also multiplies the signal by 
2sin(2z ft) and computes the average over T seconds. The result should be the 
coefficient b. From the value of (a, b), the receiver can tell the four bits that the 
transmitter sent. 

Because of the noise (we can correct for the attenuation), the receiver gets a pair 
of values Y = (Yj, Y2), as shown in the figure. The receiver essentially finds the 
constellation point closest to the measured point Y and reads off the corresponding 
bits. 


Fig. 7.12 A QAM-16 b 
constellation 
A 
Xo, Lit | 
--@--@7;-@--@-- 
oo aai er~ Y 
eee 
—————M——rL—————————» a 
or ee oe oe 
| | | | 
i} | l | 
--@--@| -@--@-- 
I 
| 
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The values of |a| and |b| are bounded, because of a power constraint on the 
transmitter. Accordingly, a constellation with more points (i.e., a larger value of k) 
has points that are closer together. This proximity increases the likelihood that the 
noise misleads the receiver. Thus, the size of the constellation should be adapted 
to the power of the noise. This is in fact what actual systems do. For instance, a 
cable modem and an ADSL modem divide the frequency band into small channels 
and they measure the noise power in each channel and choose the appropriate 
constellation for each. WiFi, LTE, and 5G systems use a similar scheme. 


7.5.1 MLE in Multidimensional Case 


We can summarize the effect of modulation, demodulation, amplification to com- 
pensate for the attenuation and the noise as follows. The transmitter sends one of the 
sixteen vectors Xy = (ax, by) shown in Fig. 7.12. Let us call the transmitted vector 
X. The vector that the receiver computes is Y. 

Assume first that 


Y=X+Z 


where Z = (Z1, Z2) and Zi, Z2 are i.i.d. N(0, o?) random variables. That is, we 
assume that the errors in Yı and Y»? are independent and Gaussian. In this case, we 
can calculate the conditional density fy|x[y|x] as follows. Given X = x, we see that 
Yı = xı + Z and Y? = x2 + Z2. Since Z, and Z» are independent, it follows that 
Y; and Y» are independent as well. Moreover, Yj = N (x1, o?) and Y? = N(x, o2). 
Hence, 


Reid = 1 T" Q — x1)" bin (y2 — x) 
um V 2002 20? 2x0? 20? l 
Recall that M L E[X|Y = y] is the value of x € (xi, . . . , X16) that maximizes this 


expression. Accordingly, it is the value x; that minimizes 


xe — yI? = G1 — y)? G2 — 2)”. 


Thus, ML E[X|Y] is indeed the constellation point that is the closest to the measured 
value Y. 


7.6 Hypothesis Testing 


There are many situations where the MAP and MLE are not satisfactory guesses. 
This is the case for designing alarms, medical tests, failure detection algorithms, 
and many other applications. We describe an important formulation, called the 
hypothesis testing problem. 
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7.6.1 Formulation 


We consider the case where X e€ (0, 1} and where one assumes a distribution of Y 
given X. The goal will be to solve the following problem: 


Maximize PCD := P[X = 1|X = 1] 


subject to PFA := P[X = 1|X = 0] < £. 


Here, PCD is the probability of correct detection, i.e., of detecting that X = 1 
when it is actually equal to 1. Also, P FA is the probability of false alarm, i.e., of 
declaring that X = 1 when it is in fact equal to zero. The constant £ is a given bound 
on the probability of false alarm. 

For making sense of the terminology, think that X — 1 means that your house 
is on fire. It is not reasonable to assume a prior probability that X — 1, so that 
the MAP formulation is not appropriate. Also, the MLE amounts to assuming that 
P(X = 1) = 1/2, which is not suitable here. In the hypothesis testing formulation, 
the goal is to detect a fire with the largest possible probability, subject to a bound 
on the probability of false alarm. That is, one wishes to make the fire detector as 
sensitive as possible, but not so sensitive that it produces frequent false alarms. 

One has the following useful concept. 


Definition 7.2 (Receiver Operating Characteristic (ROC)) If the solution of the 
problem is PCD = R(f), the function R(f) is called the Receiver Operating 
Characteristic (ROC). 


o 

A typical ROC is shown in Fig. 7.13. The terminology comes from the fact that 
this function depends on the conditional distributions of Y given X = 0 and given 
X = l, i.e., of the signal that is received about X. 

Note the following features of that curve. First, R(1) — 1 because if one is 
allowed to have P FA = 1, then one can choose X = 1 for all observations; in that 
case PCD = 1. 

Second, the function R(f) is concave. To see this, let 0 < 81 < fo < 1 and 
assume that g; (Y) achieves P[g; (Y) = 1|X = 1] = R(Bj) and P[gi(Y) = 1|X = 
0] = fij for i = 1, 2. Choose e € (0, 1) and define X’ = gi(Y) with probability € 
and X’ = go(Y) otherwise. Then, 


P[X' = 1|X 20] = eP[gi(Y) = 1|X = 0] + (1— e) P[ga(Y) = 1X = 0] 
= RI Ij 


2If Ho means that you are healthy and H; means that you have a disease, P FA is the probability 
of a false positive test and 1 — PC D is the probability of a false negative test. These are also called 
type I and type II errors in the literature. P FA is also called the p-value of the test. 
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Fig. 7.13 The Receiver 
Operating Characteristic is 
the maximum probability of 
correct detection R(B) as a To 
function of the bound £ on 
the probability of false alarm 


R(B) = max{PCDIPFA x p} 


Also, 


P[X' = 1X = 1] = eP[gi(Y) = 1X = 1] + (0  e9PIgaQ(Y) = 1X = 1] 
= €R(B1) + (d — e) R(B2). 


Now, the decision rule X that maximizes P[X = 1|X = 1] subject to P[X = 
1|X = 0] = «6; + (1 — €)f> must be at least as good as X’. Hence, 


R(eBi + (1 — £2) = eB + (1 — €)f2. 


This inequality proves the concavity of R(). 

Third, the function R(f) is nondecreasing. Intuitively, if one can make a larger 
PFA, one can decide X = 1 witha larger probability, which increases PC D. To 
show this formally, let 62 = 1 in the previous derivation. 

Fourth, note that it may not be the case that R(O) = 0. For instance, assume that 
Y = X. In this case, one chooses X2Y- X, so that PCD = 1 and PFA =Q. 


7.6.2 Solution 


The solution of the hypothesis testing problem is stated in the following theorem. 


Theorem 7.4 (Neyman-Pearson (1933)) The decision X that maximizes PCD 
subject to PFA < p is given by 


1, if L(Y) » X 
= { Iwp. y, if L(Y) 3A (7.3) 
0, if L(Y) < X. 


x 
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Fig. 7.14 Jerzy Neyman, 
1894-1981 


In these expressions, 


is the likelihood ratio, i.e., the ratio of the likelihood of y when X = 1 divided by its 
likelihood when X = 0. Also, X > O and y € [0, 1] are chosen so that the resulting 
X satisfies 


Thus, if L(Y) is large, X = 1. The fact that L(Y) is large means that the observed 
value Y is much more likely when X = 1 than when X = 0. One is then inclined 
to decide that X — 1, i.e. to guess X — |. The situation is similar when L(Y) is 
small. By adjusting A, one controls the sensitivity of the detector. If à is small, one 
tends to choose X — 1 more frequently, which increases PC D but also P FA. One 
then chooses A so that the detector is just sensitive enough so that PFA = f. In 
some problems, one may have to hedge the guess for the critical value X as we will 
explain in examples (Fig. 7.14). 

We prove this theorem in the next chapter. Let us consider a number of examples. 


7.6.3 Examples 


Gaussian Channel 
Recall our model of the scalar Gaussian channel: 


Y=xX+4+Z, 


where Z = N(0, o?) and is independent of X. In this model, X € (0, 1} and the 
receiver tries to guess X from the received signal Y. 
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We looked at two formulations: MLE and MAP. In the MLE, we want to find the 
value of X that makes Y most likely. That is, 


MLE|XI|Y = y] = arg max fy|x[y|x]. 
The answer is MLE[X|Y] = Oif Y < 0.5 and MLE[X|Y] = 1, otherwise. 
The MAP is the most likely value of X in (a, b} given Y. That is, 
MAP[X|Y = y] = arg max P[X = x|Y = y]. 
x 
To calculate the MAP, one needs to know the prior probability po that X = 0. We 
found out that MAP[X|Y = y] 2 lify > 0.5 +o? log(po/ pi) and MAP[X|Y = 


y] = 0 otherwise. 
In the hypothesis testing formulation, we choose a bound 8 on PFA = P[X = 


1|X = 0]. According to Theorem 7.4, we should calculate the likelihood ratio L(Y). 


We find that 


Note that, for any given A, P(L(Y) = A) = 0. Moreover, L(y) is strictly increasing 


in y. Hence, (7.3) simplifies to 


o Jl, ifyzyo 
0, otherwise. 


We choose yo so that P FA = f, i.e., so that 
P[X = 1|X = 0] = P[Y > yo|X = 0] = £. 


Now, given X = 0, Y = N(0, c?). Hence, yo is such that 


P(N(0, 0°) > yo) = f, 


i.e., such that 


mon 


[03 


IV 


P (NO, 1) 


= 5%, then yo/o = 1.65. Figure 7.15 


For instance, Fig. 3.7 shows that if 8 
illustrates the solution. 
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Fig. 7.15 The solution of the A 1 
hypothesis testing problem 
for a Gaussian channel 
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Let us calculate the ROC for the Gaussian channel. Let y(f) be such that 
P(N(0, 1) > y(8)) = £, so that yo = y(B)o. The probability of correct detection 
is then 


PCD = P[X 2 1|X 2 1] = P[Y > yo|X = 1] = P(N(1, 02) > yo) 
= P(N(0,0?) > yo — 1) = P(N, 1) xo lyo- 0!) 
= P(N(Q, 1) > y(B) - o7. 


Figure 7.16 shows the ROC for different values of o , obtained using Python. Not 
surprisingly, the performance of the system degrades when the channel is noisier. 


Mean of Exponential RVs 

In this second example, we are testing the mean of exponential random variables. 
The story is that a machine produces lightbulbs that have an exponentially dis- 
tributed lifespan with mean 1/4, when X = x € (0, 1}. Assume that Ao < A1. The 
interpretation is that the machine is defective when X — 1 and produces lightbulbs 
that have a shorter lifespan. 
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Let Y = (Yj,..., Yn) be the observed lifespans of n bulbs. We want to detect 
that X = 1 with PFA < 8 = 596. 
We find 


fvixlyM]. — HA exp{—Aryi} 
fy|xLy|0] — H7 4 Aoexpl-Aoyil 


E m n n 
= (2) exp ^ Ys] : 


Since Aj > Ao, we find that L(y) is strictly decreasing in ? ^; y; and also that 
P(L(Y) = X) = 0 for all A. Thus, (7.3) simplifies to 


g= 1, if} Y <a 
0, otherwise, 


L(y) = 


where a is chosen so that 


n 
Pn saxo | apos 
i=l 


Now, when X = 0, the Y; are i.i.d. random variables that are exponentially 
distributed with mean 1/Ao. The distribution of their sum is rather complicated. 
We approximate it using the Central Limit Theorem. 


We have? 
y desc enis 
: 2-0 y (Uo y 
Jn 
Now, 
n -1 -1 
Yi +--- +Y- nà a — nÀ 
) saec 1 0 < 9. 
i=1 vn vn 
Hence, 


is a— ni; 
PIS sax =o] N P (10.25% < pL 
» vn 


P (ve D < oe A) 
— , = [SS š 
n 


Recall that var(Y;) — i 
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Hence, if we want this probability to be equal to 596, by (3.2), we must choose a so 
that 


a = (n + 1.654/n)Ag .. 


One point is worth noting for this example. We see that the calculation of X is 
based on Y; +---+ Y4. Thus, although one has measured the individual lifespans of 
the n bulbs, the decision is based only on their sum, or equivalently on their average. 


Bias of a Coin 
In this example, we observe n coin flips. Given X = x € (0, 1], the coins are 
iid. B(p,). That is, given X = x, the outcomes Y, ..., Y, of the coin flips are 
iid. and equal to 1 with probability p, and to zero otherwise. We assume that 
Di > po = 0.5. That is, we want to test whether the coin is fair or biased. 

Here, the random variables Y; are discrete. We see that 


PIY: = yii =1,...,n|X =x] fup 
m p; — px)" ^5 where S = Yi +--+ 4 Ys. 
Hence, 


PLY; = y;,i=1,...,n|X = 1] 
PLY; = yi, i = l,...,n|lX = 0] 


_ (2) (, = m B h = 2 (25 = po)" 
Po 1 — pa 1 — po po(l—p)/ ` 
Since p; > po, we see that the likelihood ratio is increasing in S. Thus, the solution 
of the hypothesis testing problem is 


L(Yi,..., fn) = 


X = (S > no}, 


where ng is such that P[S > no| X = 0] ~ f. To calculate no, we approximate S, 
when X = 0, by using the Central Limit Theorem. We have 


—n no—n 
PU 0 PO 


vn yn 


«P (wo. po(1 — po)) = m) 


u no—npoY _ 2ng —n 
= P (NO, 025) = 97 ) =P (xo p= 22 J 


PIS» mix =01= P |? x =0] 


136 7 Digital Link—A 


Say that 6 = 5%, then we need 


2no —n 


Jn 


— 1.65, 


by (3.2). Hence, 
no = 0.5n + 0.83 /n. 


Discrete Observations 

In the examples that we considered so far, the random variable L(Y) is continuous. 
In such cases, the probability that L(Y) = A is always zero, and there is no need to 
randomize the choice of X for specific values of Y. In our next examples, that need 
arises. 

First consider, as usual, the problem of choosing X (0, 1) to maximize the 
probability of correct detection P[X = 1|X = 1] subject to a bound P[X = 
1|X = 0] < £ on the probability of false alarm. However, assume that we make 
no observation. In this case, the solution is to choose X — 1 with probability 
B. This choice meets the bound on the probability of false alarm and achieves a 
probability of correct detection equal to 6. This randomized choice is better than 
always deciding È =0. 

Now consider a more complex example where Y € {A, B, C} and 


P[Y = AX = 1] = 0.2, PY = BIX = 1] = 0.2, P[Y = CIX = 1] 2 0.6 
P[Y = A|X 20] = 02, P[Y = B|X = 0] = 0.5, P[Y =C|X = 0] = 0.3. 


Accordingly, the values of the likelihood ratio L(y) = P[Y = y|X = 1]/P[Y = 
y|X = 0] are as follows: 


L(A) = 1, L(B) = 0.4 and L(C) = 2. 


We rank the observations in increasing order of the values of L, as shown in 
Fig. 7.17. 


Fig. 7.17 The three possible Y B A C 


observations P[Y|X -1] 02 0.2 


P|Y|X 20] 05 0.2 0.3 
L(Y) 04 1 2 


\=2.1> PCD=0,PFA=0 
\=2= PCD = 0.6, PFA = 0.37 

A-2142 PCD =0.6, PFA =0.3 

A212 PCD =0.6 +0.27, PFA = 0.3 + 0.27 
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Fig. 7.18 The ROC for the 
discrete observation example 


The solution of the hypothesis testing problem amounts to choosing a threshold 
à and a randomization y so that 


P[X = 1]Y] = 1(L(Y) > A) + y L(Y) =A}. 


A 


Also, we choose A and y so that P[X = 1|X = 0] = £. 

Figure 7.17 shows that if we choose à = 2.1, then L(Y) < A, for all values of Y, 
so that we always decide È —0. Accordingly, PC D = 0 and PFA = 0. 

The figure also shows that if we choose X = 2 and a parameter y, then we decide 
X = | when L(Y) = 2 with probability y. Thus, if X = 0, we decide X = 1 
with probability 0.3y, because Y = C with probability 0.3 when X = 0 and this is 
precisely when L(Y) — 2 and we randomize with probability y. The figure shows 
other examples. 

It should be clear that as we reduce A from 2.1 to 0.39, the probability that we 
decide X = 1 when X = 0 increases from 0 to 1. Also, by choosing the parameter 
y suitably when A is set to a possible value of L(Y), we can adjust PFA to any value 
in [O, 1]. 

For instance, we can have P FA = 0.05 if we choose A = 2 and y = 0.05/0.3. 
Similarly, we can have PFA = 0.4 by choosing A = 1 and y = 0.5. Indeed, 
in this case, we decide X — 1 when Y — C and also with probability 0.5 when 
Y = A, so that this occurs with probability 0.3 + 0.2 x 0.5 = 0.4 when X = 0. The 
corresponding PCD is then 0.6 + 0.2 x 0.5 = 0.7. 

Figure 7.18 shows PC D as a function of the bound on P FA. 


7.7 Summary 


* MAP and MLE; 

* BPSK; 

* Huffman Codes; 

* Independent Gaussian Errors; 

* Hypothesis Testing: Neyman-Pearson Theorem. 
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7.7.1 Key Equations and Formulas 


Bayes’ Rule T= pidi/ Q7; Pjqj) Theorem 7.1 
MAP[X|Y = y] arg max, P[X = x|Y = y] Definition 7.1 
MLE[X|Y¥ = y] arg max, P[Y = y|X =x] Definition 7.1 
Likelihood Ratio L(y) = frixbrlll/fvix Ly10] Theorem 7.4 
Gaussian Channel MAP[X|Y = y] 2 ly > H o? log( oe, (7.2) 
Neyman-Pearson Theorem P[X = LY) = lH{L(Y) > A}+yl{L(Y) = à} | Theorem 7.4 
ROC ROC(p) = max. PCD s.t. PFA x B Definition 7.2 


7.8 References 


Detection theory is obviously a classical topic. It is at the core of digital commu- 
nication (see e.g., Proakis (2000)). The Neyman-Pearson Theorem is introduced in 
Neyman and Pearson (1933). For a discussion of hypothesis testing, see Lehmann 
(2010). For more details on digital communication and, in particular, on wireless 
communication, see the excellent presentation in Tse and Viswanath (2005). 


7.9 Problems 


Problem 7.1 Assume that when X = 0, Y = .4(0,1) and when X = 1,Y = 
N (0, o?) with o? > 1. Calculate MLE[X|Y]. 


Problem 7.2 Let X, Y be i.i.d. U[O, 1] random variables. Define V = X + Y and 
W=xX-Y. 


(a) Show that V and W are uncorrelated; 
(b) Are V and W independent? Prove or disprove. 


Problem 7.3 A digital link uses the QAM-16 constellation shown in Fig. 7.12 with 
x; = (1, —1). The received signal is Y = X + Z where Z =p -V (0, oI). The 
receiver uses the MAP. Simulate the system using Python to estimate the fraction of 
errors foro = 0.2, 0.3. 


Problem 7.4 Use Python to verify the CLT with i.i.d. U[0, 1] random variables X,. 
That is, generate the random variables (X1,..., Xy] for N = 10000. Calculate 


—— Xi00n44 t +++ + X(n+1)100 — 50 n 


Y, 
7 10 


= 0 1.205. 99) 
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Plot the empirical cdf of (Yo, ... , Yo9} and compare with the cdf of a J/(0, 1/12) 
random variable. 


Problem 7.5 You are testing a digital link that corresponds to a BSC with some 
error probability e € [0, 0.5). 


(a) Assume you observe the input and the output of the link. How do you find the 
MLE of e. 

(b) You are told that the inputs are i.i.d. bits that are equal to 1 with probability 0.6 
and to 0 with probability 0.4. You observe n outputs. How do you calculate the 
MLE of e. 

(c) The situation is as in the previous case, but you are told that e has pdf 4 — 8x on 
[0, 0.5). How do you calculate the MAP of e given n outputs. 


Problem 7.6 The situation is the same as in the previous problem. You observe n 
inputs and outputs of the BSC. You want to solve a hypothesis problem to detect 
that e > 0.1 with a probability of false alarm at most equal to 5%. Assume that n is 
very large and use the CLT. 


Problem 7.7 The random variable X is such that P(X — 1) — 2/3 and P(X — 
0) = 1/3. When X = 1, the random variable Y is exponentially distributed with 
rate 1. When X = 0, the random variable Y is uniformly distributed in [0, 2]. (Hint: 
Be careful about the case Y 2.) 


(a) Find MLE[X|Y]; 

(b) Find MAP[X|Y]; 

(c) Solve the following hypothesis testing problem: 
Maximize P[X = 1|X = 1] 


^ 


subject to P[X = 1|X = 0] < 5%. 


Problem 7.8 Simulate the following communication channel. There is an i.i.d. 
source that generates symbols (1, 2, 3, 4) according to a prior distribution m = 
[P1, p2, P3, p4]. The symbols are modulated by QPSK scheme, i.e. they are mapped 
to constellation points (+1, +1). The communication is on a baseband Gaussian 
channel, i.e. if the sent signal is (x1, x2), the received signal is 


— 


y= +Z], 
y2 = x2 + Zo, 


where Zı and Z» are independent N (0, o?) random variables. Find the MAP 
detector and ML detector analytically. 


140 7 Digital Link—A 


Simulate the channel using Python for x = [0.1, 0.2, 0.3, 0.4], and o = 0.1 and 
o = 0.5. Evaluate the probability of correct detection. 


Problem 7.9 Let X be equally likely to take any of the values (1, 2, 3). Given X, 
the random variable Y is W(X, 1). 


(a) Find MAP[X|Y]; 
(b) Calculate MLE[X|Y]; 
(c) Calculate E((X — Y)2). 


Problem 7.10 The random variable X is such that P(X = 0) = P(X = 1) = 0.5. 
Given X, the random variables Y, are i.i.d. U[O, 1.1 — 0.1X]. The goal is to guess x 
from the observations Y,,. Each observation has a cost B > 0. To get nice numerical 
solutions, we assume that 


B = 0.018 © 0.5(1.1)7!° log(1.1). 


(a) Assume that you have observed Y" = (Y1,..., Yp). What is the guess Eo based 
on these observations that maximizes the probability that È =X? 

(b) What is the corresponding value of PS, = n=X)? 

(c) Choose n to maximize P(X = X n) — Bn where Èn is chosen on the basis of 
Yi, ..., Yn). Hint: You will recall that 


Jn) — a* log(a). 


Problem 7.11 The random variable X is exponentially distributed with mean 1. 
Given X, the random variable Y is exponentially distributed with rate X. 


(a) Find MLE[X|Y]; 
(b) Find MAP[X|Y]; 
(c) Solve the following hypothesis testing problem: 


Maximize P[X = 1|X =a] 


subject to P[X = 1|X = 1] < 5%, 
where a > | is given. 


Problem 7.12 Consider a random variable Y that is exponentially distributed with 
parameter 0. You observe n ii.d. samples Y;,..., Y, of this random variable. 
Calculate ô = MLET([0|Yi,...,Y,]. What is the bias of this estimator, i.e., 
E[Ó — 0|0]? Does the bias converge to 0 as n goes to infinity? 
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Problem 7.13 Assume that Y =p U[a, b]. You observe n i.i.d. samples Y1, ..., Y, 
of this random variable. Calculate the maximum likelihood estimator à of a and b 
of b. What is the bias of à and b? 


Problem 7.14 We are looking at an hypothesis testing problem where X, X take 
values in (0, 1). The value of X is decided based on the observed value of the random 
vector Y. We assume that Y has a density f;(y) given that X = i, fori = 0, 1, and 
we define L(y) :— fi(y)/fo(y). : Y 

Define g(6) to be the maximum value of P[X = 1|X = 1] subject to P[X = 
1|X = 0] x £ for 6 € [0, 1]. Then (choose the correct answers, if any) 


g()zl—f; 
g(B) = f; . 
The optimal decision is described by a function h(y) = P[X = 1|Y = y] and 


this function is nondecreasing in fi(y)/ fo(y). 


Problem 7.15 Given c (0, 1), X = 0(1, 1)' +V where V; and V» are independent 
and uniformly distributed in [—2, 2]. Solve the hypothesis testing problem: 


Maximize P[ = 1|6 = 1] 
0 = 


s.t. P[Ó = 1| 
Problem 7.16 Given 0 = 1, X =p Exp(1) and, given 0 = 0, X =p U[0, 2]. 


(a) Find ô = H T [0| X, B], defined as the random variable Ó determined from X 
that maximizes P[Ó = 1|6 = 1] subject to P[Ó = 1|8 = 0] < £; 

(b) Compute the resulting value of a(6) = P[Ó = 1/0 = 1); 

(c) Sketch the ROC curve a(6) for £ € [0, 1]. 


Problem 7.17 You observe a random sequence (X,,n = 0,1,2,...}. With 
probability p, 09. = O and this sequence is i.i.d. Bernoulli with P(X, = 0) = 
P(X, = 1) = 0.5. With probability 1 — p, 0 = 1 and the sequence is a stationary 
Markov chain on (0, 1} with transition probabilities P(0,1) = P(1,0) = o. The 
parameter o is given in (0, 1). 


(1) Find MAP[0|Xo, ..., Xn]; 

(2) Discuss the convergence of ,; 

(3) Discuss the composite hypothesis testing problem where œ < 0.5 when 6 = 1 
and a = 0.5 when 0 = 0. 


Problem 7.18 If 0 = 0, the sequence (X,, n > 0} is a Markov chain on a finite set 
& with transition matrix Po. If 0 = 1, the transition matrix is Pj. In both cases, 
Xo = xo is known. Find MLE[0|Xo, ..., Xn]. 
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Topics: Optimality of Huffman Codes, LDPC Codes, Proof of Neyman- 
Pearson Theorem, Jointly Gaussian RVs, Statistical Tests, ANOVA 


8.1 Proof of Optimality of the Huffman Code 
We stated the following result in Chap. 7. Here, we provide a proof. 


Theorem 8.1 (Optimality of Huffman Code) The Huffman code has the smallest 
average number of bits per symbol among all prefix-free codes (Fig. 8.1). 


Proof 'The argument in Huffman (1952) is by induction on the number of symbols. 
Assume that the Huffman code has an average path length L(n) that is minimum 
for n symbols and that there is some other tree T with a smaller average path length 
A(n + 1) than the Huffman code for n + 1 symbols. Let X and Y be the two least 
frequent symbols and x > y their frequencies. We can pick these symbols in T so 
that their path lengths are maximum and such that Y has the largest path length in 
T. Otherwise, we could swap Y in T with a more frequent symbol and reduce the 
average path length. Accept for now the claim that we can also pick X and Y so 
that they are siblings in T. By merging X and Y into their parent Z with frequency 
z = x + y, we have constructed a code for n symbols with average path length 
A(n4- 1) — z. Hence, L(n) < A(n4-1) — z. Now, the Huffman code for n+ 1 symbol 
would merge X and Y also, so that its average path length is L(n + 1) := L(n) +z. 
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Fig. 8.1 Huffman code 


A B C D 
0.55. .0.3. 0.1. .0:05 
0 10 110 111 


Thus, L(n + 1) < A(n + 1), which contradicts the assumption that the Huffman 
code is not optimal for n + 1 symbols. It remains to prove the claim about X and 
Y being siblings. First note that Y having the maximum path length, it cannot be 
an only child, for otherwise, we would replace its parent by Y and reduce the path 
length. Say that Y has a sibling V other than X. By swapping V and X, one does 
not increase the average path length, since the frequency of V is not smaller than 
that of X. This concludes the proof. o 


8.2 Proof of Neyman-Pearson Theorem 7.4 


The idea of the proof is to consider any other decision rule that produces an estimate 
X with P[X = 1|X = 0] x £ and to show that 


P[X 2 1|X 2 1] x P[X|X — 1], (8.1) 
where X is specified by the theorem. To show this, we note that 
(X= XUAN) = 3) >20. 
Indeed, when L(Y) — A > 0, one has X md > X , so that the expression above is 
indeed nonnegative. Similarly, when L(Y) — à < 0, one has X = 0 < X, so that 
the expression is again nonnegative. 


Taking the expected value of this expression given X = 0, we find 


E[X L(Y)|X 20] — E[XL(Y)|X = 0] 


> A(E[X|X = 0] — E[X|X = 0]. (8.2) 


Now, 


E[X|X = 0] = P[X = 1X 20] 2 B > P[X = 1|X =0] = E[X|X = 0]. 
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Hence, (8.2) implies that 
E[XL(Y)|X = 0] > E[XL(Y)|X = 0]. (8.3) 


Observe that, for any function g(Y), one has 
El@ LOX =01= f so»L0) frxlylOldy 


fyixb 1] 
Old 
- ETT PET jd y|0] 


= f sor fnxtvttiay 
= E[g(Y)|X = I]. 


Note that this result continues to hold even for a function g(Y, Z) where Z is a 
random variable that is independent of X and Y. In particular, 


E[XL(Y)X = 0) = E[X|X = 1] = P[X = 1|X = 1]. 


Similarly, 


E[XL(Y)|X 20] = P[X = 1|X = 1]. 


Combining these results with (8.3) gives (8.1). 


8.3 Jointly Gaussian Random Variables 


In many systems, the errors in the different components of the measured vector Y 
are not independent. A suitable model for this situation is that 


Y = X + AZ, 


where Z = (Z1, Z2) is a pair of i.i.d. N (0, 1) random variables and A is some 2 x 2 
matrix. The key idea here is that the components of the noise vector AZ will not be 
independent in general. For instance, if the two rows of A are identical, so are the 
two components of AZ. Thus, this model allows to capture a dependency between 
the errors in the two components. The model also suggests that the dependency 
comes from the fact that the errors are different linear combinations of the same 
fundamental sources of noise. 

For such a model, how does one compute M L E[X|Y]? We explain in the next 
section that 
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1 1 
fyixly|x] = 2x|A| exp 5 x) (AA) !(y - »] : (8.4) 


where A’ is the transposed of matrix A, i.e., A'(i, j) = A(j, i) for i, j € (1, 2}. 
Consequently, the MLE is the value x; of x that minimizes 


(y — x) (AA)! (y — x) = lA ly — AT! XI|*. 


(For simplicity, we assume that A is invertible.) 
That is, we want to find the vector x; such that A^! x; is the closest to A-7ly. 
One way to understand this result is to note that 


W:=AlY=A'X4Z=:V+Z. 
Thus, if we calculate A^! Y from the measured vector Y, we find that its components 
are i.i.d. N (0, 1) for a given value of X. Hence, it is easy to calculate MLE[V|W = 
w]: it is the closest value to w in the set {Aq xy, — A-!xig] of possible values of 
V. It is then reasonable to expect that we can recover the MLE of X by multiplying 
the MLE of V = A^!X by A, i.e., that 


MLE[X|Y = y] = Ax MLE[V|W = A! y]. 


8.3.1 Density of Jointly Gaussian Random Variables 


Our goal in this section is to explain (8.4) and more general versions of this result. 
We start by stating the main definition and a result that we prove later. 


Definition 8.1 (Jointly Gaussian N(uy, Xy) Random Variables) The random 
variables Y = (Y;,..., Yp) are jointly Gaussian with mean uy and covariance 
Xy, which we write as Y =p N (uy, Xy), if 


Y = AX + py with Xy = AA’, 


where X is a vector of independent N (0, 1) random variables. 


Here is the main result. 


Theorem 8.2 (Density of N (uy, Xy) Random Variables) Let Y =p N (uy, Xy). 
Then 


= 1 1 Poin 
KO = aTr | 59 - AY) Ey (y wo}. (8.5) 
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Fig. 8.2 The level curves of fy 


The level curves of this jpdf are ellipses, as sketched in Fig. 8.2. 

Note that this joint distribution is determined by the mean and the covariance 
matrix. In particular, if Y’ = (V’, W^) are jointly Gaussian, then the joint 
distribution is characterized by the mean and Xy, Xw and cov(V, W). We know 
that if V and W are independent, then they are uncorrelated, i.e., cov(V, W) = 
0. Since the joint distribution is characterized by the mean and covariance, we 
conclude that if they are uncorrelated, they are independent. We note this fact as 
a theorem. 


Theorem 8.3 (Jointly Gaussian RVs Are Independent Iff Uncorrelated) Let V 
and W be jointly Gaussian random variables. Then, there are independent if and 
only if they are uncorrelated. 


a 
We will use the following result. 


Theorem 8.4 (Linear Combinations of JG Are JG) Let V and W be jointly 


Gaussian. Then AV + aand BW + bare jointly Gaussian. m 


Proof By definition, V and W are jointly Gaussian if they are linear functions of 
i.i.d. N (0, 1) random variables. But then AV + a and BW + b are linear functions 
of the same i.i.d. N (0, 1) random variables, so that they are jointly Gaussian. More 
explicitly, there are some i.i.d. (0, 1) random variables X so that 


MEBHERE 
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so that 
AV+a]|_[a+Ac " AC |x 
BW+b] |b-Bd BD 


As an example, let X, Y be independent N (0, 1) random variables. Then, 
X 4- Y and X — Y are independent. 


Indeed, these random variables are jointly Gaussian by Theorem 8.4. Also, they are 
uncorrelated since 


E((X + Y)(X — Y)) — E(X + Y)E(X - Y) = E(X? - Y) 20. 


Hence, they are independent by Theorem 8.3. 

We devote the remainder of this section to the derivation of (8.5). We explain in 
Theorem B.13 how to calculate the p.d.f. of AX-F b from the density of X. We recall 
the result here for convenience: 


fxg) = ja AD where Ax 4- b — y. (8.6) 


Let us apply (8.6) to the case where X is a vector of n i.i.d. N(0, 1) random 
variables. In this case, 


1 2 
Fx (x) = Mi- fx; (xj) = MET exp EI 


tis? 
-Qay2 P 2 [ 


Then, (8.6) gives 


1 14 Ix? 
fy(y) = VTCSZL exp| 7 | 
where Ax + uy = y. Thus, 
x = A (y - py) 
and 


Ixl? = lA" — wy)? = (y — uy) (A Ay — uy), 
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where we used the facts that ||z||? = z/z and (Mv)' = v/M'. 
Recall the definition of the covariance matrix: 


Xy = E(Y — E(Y))(Y - E(Y). 
Since Y = AX + py and Xx = I, the identity matrix, we see that 
Xy = AXxA! = AA’. 
In particular, 
IZy| = |A}. 


Hence, we find that 


E 1 1 RT 
f = ot 3 — uv) Ey oa). 


This is precisely (8.5). 


8.4 Elementary Statistics 


This section explains some basic statistical tests that are at the core of “data science.” 


8.4.1 Zero-Mean? 


Consider the following hypothesis testing problem. The random variable Y is 
NV (u, 1). We want to decide between two hypotheses: 


Ho : 46-0. (8.7) 
Hy: 440. (8.8) 


We know that P[|Y| > 2 | Ho] © 5%. That is, if we reject Ho when |Y| > 2, 
the probability of “false alarm,” i.e., of rejecting the hypothesis when it is correct is 
596. 'This is what all the tests that we will discuss in this chapter do. However, there 
are many tests that achieve the same false alarm probability. For instance, we could 
reject Hy when Y > 1.64 and the probability of false alarm would also be 5%. Or, 
we could reject Ho when Y is in the interval [1, 1.23]. The probability of that event 
under Ho is also about 5%. 
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Thus, there are many tests that reject Ho with a probability of false alarm equal 
to 5%. Intuitively, we feel that the first one—rejecting Ho when |Y| > 2—is 
more sensible than the others. This intuition probably comes from the idea that 
the alternative hypothesis Hı : u 4 0 appears to be a symmetric assumption about 
the likely values of u. That is, we do not have a reason to believe that under H; the 
mean u is more likely to be positive than negative. We just know that it is nonzero. 
Given this symmetry, it is intuitively reasonable that the test should be symmetric. 
However, there are many symmetric tests! So, we need a more careful justification. 

To justify the test |Y| > 2, we note the following simple result. 


Theorem 8.5 Consider the following hypothesis testing problem: Y is NV (p, 1) 
and 


Ho:u=0 


H; : u has a symmetric distribution about 0. 


Then, the Neyman—Pearson test with probability of false alarm 596 is to reject Ho 
when |Y | > 2. 


Proof We know that the Neyman-Pearson test is a likelihood ratio test. Thus, it 
suffices to show that the likelihood ratio is increasing in |Y |. Assume that the density 
of u under H; is h(x). (The same argument goes through it u is a mixed random 
variable.) Then the pdf f1(y) of Y under H is as follows: 


AO = f hoo fly — xjdx, 


where f(x) = (1/42z) exp(—0.5y?) is the pdf of a “~ (0, 1) random variable. 
Consequently, the likelihood ratio L(y) of Y is given by 


2 
L(y) - 7 - E q n lax = [neoesot-itee| 5 ] ax 
2 
= 05 [mw 1c x)]exp(— wtepl-5 ax 


=0.5 i h(x)[exp{xy} + exp{—xy}]dx 


where the fourth identity comes from h(x) = 0.5h(x) + 0.5h(—x), since h(x) = 
h(—x). This expression shows that L(y) = L(—y). Also, 
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L'(y)=0.5 f h(x)x[exp{xy}— exp(—xy]dx Í h(x)x[exp{xy}— exp{—xy}]dx, 


by symmetry of the integrand. For y > 0 and x > 0, we see that the last integrand 
is positive, so that L’(y) > 0 for y > 0. 

Hence, L(y) is symmetric and increasing in y > 0, so that it is an increasing 
function of |y|, which completes the proof. o 


As a simple application, say that you buy 100 light bulbs from brand A and 100 
from brand B. You want to test whether that have the same mean lifetime. You 
measure the lifetimes {x4, T Xi and (xe, DM XT o of the bulbs of the two 
batches and you calculate 


e (Xf +--+ Xi) — XP o X foo) 


Y , 
oV N 


where o is the standard deviation of X + X that we assume to be known. 

By the CLT, it is reasonable to approximate Y by a // (0, 1) random variable. 
Thus, we reject the hypothesis that the bulbs of the two brands have the same average 
lifetime if |Y| > 2. 

Of course, assuming that o is known is not realistic. The next test is then more 
practical. 


8.4.2 Unknown Variance 


A practically important variation of the previous example is when the variance c? 
is not known. In that case, the Neyman-Pearson test is to decide H; when 


Lae 
Oo 


where Å is the sample mean of the Ym, as before, 


1 n 

^ A ^2 

o suam) 
m= 


is the sample variance, and À is such that pb > tr-1) = Pp. 
Here, tn—ı is a random variable with a ¢ distribution with n — 1 degrees of 
freedom. By definition, this means that 


WN (0, 1) 
Jx- 0 


tn-1 = 
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Fig. 8.3 The projection error | i |? — PX ! 
P= 


lew]? = (n— 1)8? 


where x] is the sum of the squares of n — 1 i.i.d. (0, 1) random variables. 

Thus, this chi-squared test is very similar to the previous one, except that one 
replaces the standard deviation o by it estimate ô and the threshold A is adjusted 
(increased) to reflect the uncertainty in c. Statistical packages provide routines to 
calculate the appropriate value of A. (See scipy.stats.chisquare for Python.) 

Figure 8.3 explains the result. The rotation symmetry of Z implies that we can 
assume that V = Z, and that W = (0, Zo, ..., Zn). As in the previous examples, 
one uses the symmetry assumption under Hj to prove that the likelihood ratio is 
monotone in £1/G . 

Coming back to our lightbulbs example, what should we do if we have different 
number of bulbs of the two brands? The next test covers that situation. 


8.4.3 Difference of Means 


You observe (X,,n = 1,..., n1} and {Y,,n = 1,...,no). Assume that these 
random variables are all independent and that the X,, are “M (u1, 1) and the Y, 
are “N (u2, 1). We want to test whether u1 = 42. 

Define 


1 Xite t Xn Yit tYm 
nı n2 f 


Then Z = M“ (u, 1) where u = py — pa. Testing uı = u2 is then equivalent to 
testing u = 0. A sensible decision is then to reject the hypothesis that yı = u2 if 
|Z| > 2. 

In practice, if n; and n» are not too small, one can invoke the Central Limit 
Theorem to justify the same test even when the random variables are not Gaussian. 
That is typically how this test is used. Also, when the random variables have nonzero 
means and unknown variances, one then renormalizes them by subtracting their 
sample mean and dividing by the sample standard deviation. 
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Needless to say, some care must be taken. It is not difficult to find distributions 
for which this test does not perform well. This fact helps explain why many poorly 
conducted statistical studies regularly contradict one another. Many publications 
decry this fallacy of the p-value. The p-value is the name given to the probability of 
false alarm. 


8.4.A Mean in Hyperplane? 


A generalization of the previous example is as follows: 


Ho: Y= .K(n,o?D,u e. 
Hi: Y 2 V(u,0°D, we 8". 
Here, .Z? is an m-dimensional subspace in N”. 


Here is the test that has a probability of false alarm (deciding Hı when Ho is 
true) less than 6: Decide 


1 
H = H; if and only if —|Y — fill? > £i. 
o 
where 
Ô = arg min{||Y — xl? : x € 2} 
PO, > Bn—m) = B. 

In this expression, x2 ,, represents a random variable that has a chi-square 
distribution with n — m degrees of freedom. This means that it is distributed like 
the sum of n — m random variables that are i.i.d. (0, 1). 

Figure 8.4 shows that 

Y—-ü-oZ. 


Now, the distribution of Z is invariant under rotation. Consequently, we can rotate 
the axes around u so that Z = o (0,...,0, Zm4i,..., Zn). Thus, 


Fig. 8.4 The projection error 
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Y= 90:0, Zap, oct Zn), 


so that |Y — AJ? = UIS cms 25. which proves the result. 

As in our simple example, this test has a probability of false alarm equal to £. 
Here also, one can show that the test maximizes the probability of correct detection 
subject to that bound on the probability of false alarm if under Hı one knows that 
u has a symmetric pmf around 2. This means that u = yj + v; with probability 
pi/2 and yj — vj with probability p;/2 where y; € -Z and vj is orthogonal to Z, 
for i = 1,..., K. The continuous version of this symmetry should be clear. The 
verification of this fact is similar to the simple case we discussed above. 


8.4.5 ANOVA 


Our next model is more general and is widely used. In this model, Y — 
W (Ay, oI). We would like to test whether M y = 0, which is the Ho hypothesis. 
Here, A isan x k matrix, with k <n. Also, M is aq x k matrix with q < k. 

The decision is to reject Ho if F > Fo where 


p Y- nol? = IY- ml? n-k 
IY= al? q 


Ho = arg min{||Y — ull’: u = Ay, My = 0} 


pi = arg min{||Y — wl: u = Ay} 


2 
B Xq/d 
B=P (a -p > n) . 


In the last expression, the ratio of two x? random variables is said to be an F 
distribution, in the honor of Sir Ronald A. Fisher who introduced this F-test in 
1920. 

This test has a probability of false alarm equal to £, as Fig. 8.5 shows. This figure 
represents the situation under Hy, when Y = uo + oZ and shows that F is the ratio 
of two x? random variables, so that it has an F distribution. 

As in the previous examples, the optimality of the test in terms of probability of 
correct detection requires some symmetry assumptions of u under Hj. 


8.5 LDPC Codes 


Low Density Parity Check (LDPC) codes are among the most efficient codes used in 
practice. Gallager invented these codes in his 1960 thesis (Gallager 1963, Fig. 8.6). 
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Fig. 8.5 The F-test. The 
figure shows that F is the 
ratio of two independent 
chi-square random variables 


Fig. 8.6 Robert G. Gallager, 
b. 1931 


These codes are used extensively today, for instance, in satellite video transmissions. 
They are almost optimal for BSC channels and also for many other channels. 

The LDPC codes are as follows. Let x € (0, 1)" be an n-bit string to be 
transmitted. One augments this string with the m-bit string y where 


y = Hx. (8.9) 


Here, H is an m x n matrix with entries in (0, 1}, one views x and y as column 
vectors and the operations are addition modulo 2. For instance, if 


10111000 
01011010 
11000101 
00101111 


and x — [01001010], then y — [1110]. This calculation of the parity check bits y 
from x is illustrated by the graph, called Tanner graph, shown in Fig. 8.7. 

Thus, instead of simply sending the bit string x, one sends both x and y. The bits 
in y are parity check bits. Because of possible transmission errors, the receiver may 
get X and y instead of x and y. The receiver computes HX and compares the result 
with y. The idea is that if y = HX, then it is likely that k = x and y = y. In other 
words, it is unlikely that errors would have corrupted x and y in a way that these 
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Fig. 8.7 Tanner graph j 
representation of the LDPC 0 0 : 
code. The graph shows the edge if H(i, j) =1 
nonzero entries of H, so that 1 4 Pp 
y = Hx. The receiver gets x 4 
and y instead of x and y. The 0 0 11 
nodes x; are called message 
nodes and the nodes y; are 0 0 1 1 
called check nodes 
0 1 11 
0 0 
0 0 
1 1 z 
7 y 
0 0 
y = Hx 
x X 


vectors would still satisfy the relation y = Hx. Thus, one expects the scheme to be 
good at detecting errors, at least if the matrix H is well chosen. 

In addition to detecting errors, the LDPC code is used for error correction. If 
y z HX, one tries to find the least number of components of x and y that can 
be changed to satisfy the equations. These would be the most likely transmission 
errors, if we assume that bit errors are i.i.d. have a very small probability. However, 
searching for the possible combinations of components to change is exponentially 
hard. Instead, one uses iterative algorithms that approximate the solution. 

We illustrate a commonly used decoding algorithm, called belief propagation 
(BP). We assume that each received bit is erroneous with probability e «& 1 and 
correct with probability e — 1 — e, independently of the other bits. We also assume 
that the transmitted bits x; are equally likely to be O or 1. This implies that the parity 
check bits y; are also equally likely to be O or 1, by symmetry. In this algorithm, the 
message nodes x; and the check nodes y; exchange beliefs along the links of the 
graph of Fig. 8.7 about the probability that the x; are equal to 1. 

In steps 1, 3, 5, ... of the algorithm, each node x; sends to each node y; to which 
it is attached an estimate of P(x; = 1). Each node y; then combines these estimates 
to send back new estimates to each x; about P(x; — 1). Here is the calculation 
that the y nodes perform. Consider a situation shown in Fig. 8.8 where node y; gets 
the estimates a = P(x; = 1), b = P(x2 = 1),c = P(x3 = 1). Assume also that 
31 = 1, from which node y; calculates P[y; = 1|y,] = 1 — € = €, by Bayes’ rule. 
Since the graph shows that x; +.x2-+x3 = y1, node y; estimates the probability that 
xı = l as the probability that an odd number of bits among {x2, x3, y1) are equal to 
one (Fig. 8.9). 

To see how to do the calculation, assume that x1, ..., Xn are independent (0, 1}- 
random variables with p; — P(x; — 1). Note that 


IQ 229 )x(05213 
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Fig. 8.8 Node y; gets 
estimates from x nodes and 
calculates new estimates 


Pı pa Pn 
[O] oO [III O 
1 lo, 
P(odd) = 2. sia — 2p;) 


Fig. 8.9 Each node j is equal to one w.p. p; and to zero otherwise, independently of the other 
nodes. The probability that an odd number of nodes are one is given in the figure 


is equal to zero if the number of variables that are equal to one among {x1,..., Xn} 
is even and is equal to two if it is odd. Thus, taking expectation, 


2P (odd) = 1 — I7/_,(1 — 2pi), 
so that 


1 
ID -2p (8.10) 


1 


Thus, in Fig. 8.8, one finds that 


P(x; = 1) = P(odd among x», xs, y1) 


a 2b)(1 — 2c)(1 — 22). (8.11) 


The y-nodes in Fig. 8.7 use that procedure to calculate new estimates and send 
them to the x-nodes. 

In steps 2, 4, 6,... of the algorithm, each x; nodes combines the estimates of 
P(x; = 1) it gets from x; and from the y-nodes in the previous steps to calculate 
new estimates. Each node x; assumes that the different estimates it got are derived 
from independent observations. That is, node x; gets opinions about P(x; — 1) 
from independent experts, namely X; and the y; to which it is attached in the graph. 
Node x; will merge the opinion of these experts to calculate new estimates. 

How should one merge the opinions of independent experts? Say that N experts 
make independent observations Y;,..., Yy and provide estimates pj = P[X = 
1|Y;]. Assume that the prior probability is that P(X = 1) = P(X = 0) = 1/2. 
How should one estimate P[X = 1|p1,..., pw]? Here is the calculation. 
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Fig. 8.10 Merging the P(X —1 
opinion of independent 


J=? 
experts about P(X = 1) " 
when the prior is 1/2 
"d © 
d b 
— — 


Ke I1 


One has 
P(X =1,%,...,Y, 
PIX = 1% 55.05 YN] = ( Ecce EN) 
P(Yi,..., Yn) 
u P[Yi,..., Yy|X = I]]P(X = 1) 
CU esas yt = x]P(X = x) 
u P[Yi|X 2 1] x--- x P[Yu|X = 1] 
^ D. PIXX = x] x -+ x PIX ex] 
(8.12) 
Now, 
P(X =x,Y, P[X = x|Y4]P(Y, 
PLE)E apa ( x, Yn) E [ x|Y4]P( n) 
P(X =x) 1/2 
Thus, 
P[Y,|X = 0] = 2(1 — pn)P (Yn) and P[Y,| X = 1] = 2p, P (Yn). 
Substituting these expressions in (8.12), one finds that 
PDC E oor Ey] LIM (8.13) 


^o ppe puc — pp: (0 — py)’ 


as shown in Fig. 8.10. 
Let us apply this rule to the situation shown in Fig. 8.11. In the figure, node x, 
gets an estimate € of P(x; = 1) from observing x; = 0. It also gets estimates a, b, c 
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Fig. 8.11 Node x; gets m 
estimates of P(x; = 1) from € Tı a 
y nodes and calculates new 0» b 

estimates UP 
£ 

Y3 


from the nodes y1, y2, y3 and node xı assumes that these estimates were based on 
independent observations. 

To calculate a new estimate that it will send to node yı, node x; combines the 
estimates from X4, y2 and ys. This estimate is 


a (8.14) 
ebc + ebc 
where b = 1 — b and c = 1 — c. In the next step, node x; will send that estimate to 
node y;. It also calculates estimates for nodes y2 and y3. 

Summing up, the algorithm is as follows. At each odd step, node x; sends X (i, j) 
to each node y;. At each even step, node y; sends Y (i, j) to each node xj. One has 


1 1 
YG, j) = 5 atl 2€)(1 — 2yiMIsea(, j — 2XG, s)), (8.15) 
where A(i, j) = (s 4 j | H(i, s) = 1} and 


NG, j) 


X(i, j) = NG, j+ DG, ' (8.16) 
where 
NG, j) = Ply = 1%; HwzilH w, =Y 0, J) 
and 
D(i, j) = P[x; = 0|XjMItvzin(v p=} — YQ, j)) 
with 


Plx; = 1|x;j] =e+(1- 2€)Xx;. 


Also, node x; can update its probability of being 1 by merging the opinions of 
the experts as 


NG) 


Yj 4 x 8.17 
O7 N+ DG) Seg 
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Fig. 8.12 Belief propagation 14 T T 
applied to the example of 
Fig. 8.7. The horizontal axis 
is the step of the algorithm. 
The vertical axis is the best f 
guess for each x (i) at that 1} 

step. For clarity, we separated 
the guesses by 0.1. The final 


detection is oer | 
[0, 1, 0, 0, 1, 0, 1, 0], which is \ 
intuitively the best guess 06+ \ 
&(8) 
&(6) 
04} 
&(4) 
"E i3 
| &(1) 


where 


N(j) = Phy = Wx WT jaw. pjay¥, j) 


and 


D(j) = Plxj = OX yyw, p= — Y (v, 7). 


After enough iterations, one makes the detection decisions x; = 1{X (j) > 0.5}. 

Figure 8.12 shows the evolution over time of the estimated probabilities that 
the x; are equal to one. Our code is a direct implementations of the formulas in 
this section. More sophisticated implementations use sums of logarithms instead of 
products. 

Simulations, and a deep theory, show that this algorithm performs well if the 
graph does not have small cycles. In such a case, the assumption that the estimates 
are obtained from independent observations is almost correct. 


8.6 | Summary 


* LDPC Codes; 

e Jointly Gaussian Random Variables, independent if uncorrelated; 
* Proof of Neyman-Pearson Theorem; 

* Testing properties of the mean. 
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8.6.1 Key Equations and Formulas 


LDPC y= Hx (8.9) 

P(odd) POO: Xj = 1) 2 0.5 — OSIT;(1 — 2pj) (8.10) 
Fusion of Experts PIX =1|%,..., Yn) = 17; pj/UTj pj + Hjpj) (8.13) 
Jointly Gaussian N(w, X) & fx 2... (8.4) 

If X, Y are J.G., then X L Y = X, Y are independent Theorem 8.3 


8.7 References 


The book (Richardson and Urbanke 2008) is a comprehensive reference on LDPC 
codes and iterative decoding techniques. 


8.8 Problems 


Problem 8.1 Construct two Gaussian random variables that are not jointly Gaus- 
sian. Hint: Let X =p M (0, 1) and Z be independent random variables with 
P(Z = 1) = P(Z = —1) = 1/2. Define Y = XZ. Show that X and Y meet 
the requirements of the problem. 


Problem 8.2 Assume that X =p (Y + Z)/4/2 where Y and Z are independent and 
distributed like X. Show that X = .///(0, o°) for some o? > 0. Hint: First show 
that E(X) = 0. Second, show by induction that X =p (Vi +--+ + Vin)/./m for 
m = 2”. where the V; are i.i.d. and distributed like X. Conclude using the CLT. 


Problem 8.3 Consider Problem 7.8 but assume now that Z =p -/ (0, X) where 
ye 0.2 0.1 
0.1 0.3 


The symbols are equally likely and the receiver uses the MLE. Simulate the system 
using Python to estimate the fraction of errors. 
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m | 


Check for | 
updates | 


Application: Estimation, Tracking 
Topics: LLSE, MMSE, Kalman Filter 


9.1 Examples 


A GPS receiver uses the signals it gets from satellites to estimate its location 
(Fig. 9.1). Temperature and pressure sensors provide signals that a computer uses 
to estimate the state of a chemical reactor. 

A radar measures electromagnetic waves that an object reflects and uses the 
measurements to estimate the position of that object (Fig. 9.2). 

Similarly, your car's control computer estimates the state of the car from 
measurements it gets from various sensors (Fig. 9.3). 


9.2 Estimation Problem 


The basic estimation problem can be formulated as follows. There is a pair of 
continuous random variables (X, Y). The problem is to estimate X from the 
observed value of Y. 

This problem admits a few different formulations: 


* Known Distribution: We know the joint distribution of (X, Y); 
* Off-Line: We observe a set of sample values of (X, Y); 
* On-Line: We observe successive values of samples of (X, Y); 
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Fig. 9.1 Estimating the 
location of a device from 
satellite signals 


Fig. 9.2 Estimating the 
position of an object from 
radar signals 


Fig. 9.3 Estimating the state 
of a vehicle from sensor 


signals e 


The objective is to choose the inference function g(-) to minimize the expected 
error C(g) where 


C(g) = E(c(X, g(Y))). 


In this expression, c(X, x ) is the cost of guessing X when the actual value is X. A 
standard example is 


c(X, X) =|X — XP’. 


We will also study the case when X € Rİ ford > 1. In such a situation, one 
uses c(X, x )2IIX — X | [*. If the function g(-) can be arbitrary, the function that 
minimizes C (g) is the Minimum Mean Squares Estimate (MMSE) of X given Y. If 
the function g(-) is restricted to be linear, i.e., of the form a+ BY, the linear function 
that minimizes C (g) is the Linear Least Squares Estimate (LLSE) of X given Y. One 
may also restrict g(-) to be a polynomial of a given degree. For instance, one may 
define the Quadratic Least Squares Estimate OL SE of X given Y. See Fig. 9.4. 
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Fig. 9.4 Least squares LLSE 
estimates of X given Y: 
LLSEislinear, QLSE is 
quadratic, and M M SE can be 
an arbitrary function QLSE 


Y 
A 


MMSE 


As we will see, a general method for the off-line inference problem is to choose 
a parametric class of functions (gy, w € 917) and to then minimize the empirical 
error 


K 


Yi c(Xk, Sw(¥e)) 


k=1 


over the parameters w. Here, the (X4, Yg) are the observed samples. The parametric 
function could be linear, polynomial, or a neural network. 

For the on-line problem, one also chooses a similar parametric family of 
functions and one uses a stochastic gradient descent algorithm of the form 


wk + 1) = wk) — y Vuc(Xiai 8w(Yk+1)), 


where V is the gradient with respect to w and y > O0 is a small step size. The 
justification for this approach is that, since y is small, by the SLLN, the update 
tends to be in the direction of 
krK-1 
— M3 Vuc(Xizis Bw (Vig) & -KVE(c(Xi, gu(YQ)) = —KVC(gw), 
i=k 


which would correspond to a gradient algorithm to minimize C (gw). 


9.3 Linear Least Squares Estimates 


In this section, we study the linear least squares estimates. Recall the setup that we 
explained in the previous section. There is a pair (X, Y) of random variables with 
some joint distribution and the problem is to find the function g(Y) = a + bY that 
minimizes 


C(g) = E(X — g(Y)p). 
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One consider the cases where the distribution is known, or a set of samples has been 
Observed, or one observes one sample at a time. 

Assume that the joint distribution of (X, Y) is known. This means that we know 
the joint cumulative distribution function (j.c.d.f.) Fx y (x, y).! 

We are looking for the function g(Y) = a + bY that minimizes 


C(g) = E(X — gQ)) = E(X - a — bY |’). 
We denote this function by L[X|Y ]. Thus, we have the following definition. 


Definition 9.1 (Linear Least Squares Estimate (LLSE)) The LLSE of X given 
Y, denoted by L[X|Y], is the linear function a + bY that minimizes 


E(X — a —5bY[^). 


Note that 
Ci) = E(X? +a? + b^y* — 2aX — 2bXY + 2abY) 
= E(X?) + a? + P E(Y?) — 2a E(X) — 2bE(XY) + 2abE (Y). 


To find the values of a and b that minimize that expression, we set to zero the partial 
derivatives with respect to a and b. This gives the following two equations: 


0 = 2a — 2E(X) + 2bE(Y) (9.1) 
0 = 2bE(Y?) — 2E(XY) + 2aE(Y). (9.2) 
Solving these equations for a and b, we find that 


cov(X, Y) 
L[X|Y] =a+ bY = E(X) + —__(Y — E(Y)), 
var(Y) 


where we used the identities 
cov(X, Y) = E(XY) — E(X)E(Y) and var(Y) = E(Y?) — E(Y)’. 


We summarize this result as a theorem. 


'See Appendix B. 


9.3 Linear Least Squares Estimates 


Theorem 9.1 (Linear Least Squares Estimate) One has 


cov(X, Y) 
L[X|Y] = E(X) + — ——«(Y — E(Y)). 
var(Y) 
As a first example, assume that 
Y =&X +Z, 


where X and Z are zero-mean and independent. In this case, we find ? 
cov(X, Y) = E(XY) — E(X)E(Y) 
= E(X («X + Z) = a E(X?) 
var(Y) = o?var(X) + var(Z) = o? E(X?) + E(Z2). 
Hence, 


a E(X?) a lY 


L[X|Y] — = l 
Lan a? E(X?) + E(Z?) 1+SNR-! 


where 


o? E(X?) 


167 


(9.4) 


is the signal-to-noise ratio, i.e., the ratio of the power E (œ? X?) of the signal in Y 
divided by the power E(Z?) of the noise. Note that if SNR is small, then L[X|Y] is 
close to zero, which is the best guess about X if one does not make any observation. 
Also, if SNR is very large, then L[X|Y] ~ «^Y, which is the correct guess if 


Z=0. 
As a second example, assume that 


X —- aY + pY?, 


where? Y =p U[0, 1]. Then, 


?Indeed, E(XZ) = E(X)E(Z) — 0, by independence. 
?Thus, 


EYS = 0447. 


(9.5) 
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Fig. 9.5 The figure shows Y LLSE 
L[oY + BY?|Y] when 
Y =p U(0, 1] 
E(X) = aE(Y) + BE(Y?) = a/2 + B/3; 
cov(X, Y) 2 E(XY) — E(X)E(Y) 
= E(aY? + BY?) — (@/2 + B/3)(1/2) 
=a/3+ B/4 — a/4 — B/6 
= (a + 8)/12 
var(Y) = E(Y?) — E(YY' = 1/3 — (0/2) = 1/12. 
Hence, 


L[X|Y] = a/2 + B/3 + (a + B)(Y — 1/2) = —B/6 + (a + B)Y. 


This estimate is sketched in Fig. 9.5. Obviously, if one observes Y , one can compute 
X. However, recall that L[X|Y] is restricted to being a linear function of Y. 


9.3.1 Projection 


There is an insightful interpretation of L[X|Y] as a projection that also helps 
understand more complex estimates. This interpretation is that L[X|Y] is the 
projection of X onto the set -Z (Y) of linear functions of Y. 

This interpretation is sketched in Fig. 9.6. In that figure, random variables are 
represented by points and .Z/(Y) is shown as a plane since the linear combination of 
points in that set is again in the set. In the figure, the square of the length of a vector 
from a random variable V to another random variable W is E(|V — W|*). Also, 
we say that two vectors V and W are orthogonal if E(VW) = 0. Thus, L[X|Y] = 
a -- bY is the projection of X onto .Z'(Y) if X — L[X|Y] is orthogonal to every linear 
function of Y, i.e., if 


E((X — a — bY)(c -- dY)) — 0, Vc, d € Ñ. 
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Fig. 9.6 L[X|Y] is the 
projection of X onto .Z/(Y) 


Fig. 9.7 Example of Y 
projection ^ 
X 
Z 
(c) 
0 
a) x 
Equivalently, 
E(X) — a -- bE(Y) and E((X — a — bY)Y) = O. (9.6) 


These two equations are the same as (9.1)-(9.2). We call the identities (9.6) the 
projection property. 
Figure 9.7 illustrates the projection when 


X = A (0, 1) and Y = X + Z where Z = .V (0, o°). 
In this figure, the length of Z is equal to y E (Z?) = ø, the length of X is y E (X?) = 


1 and the vectors X and Z are orthogonal because E(X Z) = 0. 
We see that the triangles OX X and OXY are similar. Hence, 


XI] xil 
IXIL YI 


so that 


II 1 IY 


1 Vico? 1+0?’ 


since ||Y|| = /1 + o2. This shows that 


1 


X = ———Y. 
1+0? 
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To see why the projection property implies that L[X|Y] is the closest point to X 
in Z(Y), as suggested by Fig. 9.6, we verify that 


E(X — LIXIYW) < E(X — ha, 


for any given h(Y) = c + dY. The idea of the proof is to verify Pythagoras’ identity 
on the right triangle with vertices X, L[X|Y] and A(Y). We have 


E(X — &(Y)?) = E(X — L[X|Y] + L[XIY] — AW)! 
= E(X — L[X|Y]?) + E(L[XIY] — AW) |) 


+ 2E(X — L[X|YDGQ.[X]|Y] — h(Y))). 


Now, the projection property (9.6) implies that the last term in the above expression 
is equal to zero. Indeed, L[X|Y] — h(Y) is a linear function of Y. It follows that 


E(X — h(Y)?) = E(X — L[X|Y]P) + E(ILIXIY] — &()P)) 
> E(X — L[X|Y]P), 


as was to be proved. 


9.4 Linear Regression 


Assume now that, instead of knowing the joint distribution of (X, Y), we observe K 
iid. samples (X1, Y1), ..., (Xx, Yx) of these random variables. Our goal is still to 
construct a function g(Y) = a+ bY so that 


E(|X —a—bY|’) 


is minimized. We do this by choosing a and b to minimize the sum of the squares 
of the errors based on the samples. That is, we choose a and b to minimize 


K 
xc — a — bY”. 
k=1 


To do this, we set to zero the derivatives of this sum with respect to a and b. Algebra 
shows that the resulting values of a and b are such that 


cove (X, Y) 
a+bY = Ex (X) + — — ——(Y = Eg(Y)), (9.7) 
varg (Y) 


where we defined 
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Fig. 9.8 The linear Y 
regression of X over Y A 


Samples 


Linear regression 


X 


K K 

1 1 

Ek(X) — K ] Xy, Ek(Y) = K ] Yk, 
k=1 k=1 


K 
1 
cov (X, Y) = & ) | Xe¥e — Ek WEK Y), 
k=1 


K 
1 
varg (Y) = = = Y? —Eg(Y). 
k=1 


That is, the expression (9.7) is the same as (9.3), except that the expectation is 
replaced by the sample mean. The expression (9.7) is called the linear regression of 
X over Y. It is shown in Fig. 9.8. 

One has the following result. 


Theorem 9.2 (Linear Regression Converges to LLSE) As the number of samples 
increases, the linear regression approaches the LLSE. 


a 
Proof As K — œ, one has, by the Strong Law of Large Numbers, 
Ex(X) > E(X), Ek(Y) > E(Y), 
covk (X, Y) > cov(X, Y), varg (Y) — var(Y). 


Combined with the expressions for the linear regression and the LLSE, these 
properties imply the result. o 


Formula (9.3) and the linear regression provide an intuitive meaning of the 
covariance cov(X, Y). If this covariance is zero, then L[X|Y] does not depend 
on Y. If it is positive (negative), it increases (decreases, respectively) with Y. 
Thus, cov(X, Y) measures a form of dependency in terms of linear regression. For 
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Fig. 9.9 The random Y 
variables X and Y are Equally likely 
uncorrelated. Note that they Dum 


are not independent 


instance, the random variables in Fig. 9.9 are uncorrelated since L[X|Y] does not 
depend on Y. 


9.5  ANoteonOverfitting 


In the previous section, we examined the problem of finding the linear function a 4- 
bY that best approximates X, in the mean squared error sense. We could develop the 
corresponding theory for quadratic approximations a+ bY + cY?, or for polynomial 
approximations of a given degree. The ideas would be the same and one would have 
a similar projection interpretation. 

In principle, a higher degree polynomial approximates X better than a lower 
degree one since there are more such polynomials. The question of fitting the 
parameters with a given number of observations is more complex. 

Assume you observe N data points ((X5, Yn), n = 1,..., N}. If the values Y, 
are different, one can define the function g(-) by (Yn) = X, for n = l,...,N. 
This function achieves a zero-mean squared error. What is then the point of looking 
for a linear function, or a quadratic, or some polynomial of a given degree? Why not 
simply define g (Yn) = Xn? 

Remember that the goal of the estimation is to discover a function g(-) that is 
likely to work well for data points we have not yet observed. For instance, we hope 
that E(C (X41, g(Yy41)) is small, where (X y..1, Yy+1) has the same distribution 
as the samples (Xn, Yn) we have observed for n = 1,..., N. 

If we define g(Y,) = Xn, this does not tell us how to calculate g(Yy+1) for a 
value Yy44 we have not observed. However, if we construct a polynomial g(-) of 
a given degree based on the N samples, then we can calculate g(Y,+1). The key 
observation is that a higher degree polynomial may not be a better estimate because 
it tends to fit noise instead of important statistics. 

As a simple illustration of overfitting, say that we observe (Xj, Yı) and Y2. 
We want to guess X». Assume that the samples Xn, Y, are all independent and 
U[—1, 1]. If we guess X2 = 0, the mean squared error is E((X2 — X>)?) = 
E(X?) = 1/3. If we use the guess X5 = X, based on the observations, then 
E((X5 — X592) = E((X2— X1») = 2/3. Hence, ignoring the observation is better 
than taking it into account. 

The practical question is how to detect overfitting. For instance, how does one 
determine whether a linear regression is better than a quadratic regression? A simple 
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test is as follows. Say you observed N samples ((X,, Yn), n = 1,..., N}. You 
remove sample n and compute a linear regression using the N — 1 other samples. 
You use that regression to calculate the estimate X n Of X, based on Y,. You then 
compute the squared error (X, — È n)?. You repeat that procedure for n = 1,..., N 
and add up the squared errors. You then use the same procedure for a quadratic 
regression and you compare. 


9.6 MMSE 


For now, assume that we know the joint distribution of (X, Y) and consider the 
problem of finding the function g (Y) that minimizes 


E(X — 80)”, 


per all the possible functions g(-). The best function is called the MMSE of X 
given Y. We have the following theorem: 


Theorem 9.3 (The MMSE Is the Conditional Expectation) The MMSE of X 
given Y is given by 


g(Y) = E[X|Y], 


where E[X|Y] is the conditional expectation of X given Y. 


Before proving this result, we need to define the conditional expectation. 


Definition 9.2 (Conditional Expectation) The conditional expectation of X given 
Y is defined by 


gx = y= f XfxivIx|y]dx., 


where 


_ fxy@y) 


fxivIxly] = ho 


is the conditional density of X given Y. 
o 


Figure 9.10 illustrates the conditional expectation. That figure assumes that the 
pair (X, Y) is picked uniformly in the shaded area. Thus, if one observes that Y € 
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Fig. 9.10 The conditional 
expectation E[ X|Y] when the 
pair (X, Y) is picked 
uniformly in the shaded area 


(y, y + dy), the point X is uniformly distributed along the segment that cuts the 
shaded area at Y — y. Accordingly, the average value of X is the mid-point of that 
segment, as indicated in the figure. The dashed red line shows how that mean value 
depends on Y and it defines E[X|Y ]. 

The following result is a direct consequence of the definition. 


Lemma 9.4 (Orthogonality Property of MMSE) 


(a) For any function $(), one has 
E((X — E[X|YD9 (Y)) = 0. (9.8) 
(b) Moreover, if the function g(Y) is such that 
E((X — g(Y))6(Y)) = 0, VO), (9.9) 
then g(Y) = E[X|Y]. 


Proof 


(a) To verify (9.8) note that 
EEIXIYO) = | EIXIY = veo foy 


© fxvG. y) 
—————d d 
z J> fr) x$ (y) fy )dy 


=f f. xoy) fx.y (x, y)dxdy 
= E(X¢(Y)), 


which proves (9.8). 


9.6 MMSE 175 


Fig. 9.11 The conditional Xx 
expectation E[X|Y] as the 
projection of X on the set 
4 (Y) of functions of Y 


G(Y) = (g(Y)| g(-) is a function} 


(b) To prove the second part of the lemma, note that 


E(|g(Y) — ELX|¥11’) 
= E((g(Y) — E[X|YD((g(Y) — X) — CE[X|Y] — X)}) = 0, 


because of (9.8) and (9.9) with $ (Y) = g(Y) — E[X|Y]. 

Note that the second part of the lemma simply says that the projection 
property characterizes uniquely the conditional expectation. In other words, 
there is only one projection of X onto 4 (Y). 

oO 


We can now prove the theorem. 


Proof of Theorem 9.3 The identity (9.8) is the projection property. It states that X — 
E[X|Y] is orthogonal to the set 4 (Y) of functions of Y, as shown in Fig. 9.11. 

In particular, it is orthogonal to h(Y) — E[X|Y]. As in the case of the LLSE, this 
projection property implies that 


E(X — h(Y)^) = E(X — ELXIYTI), 


for any function A (-). This implies that E[X|Y] is indeed the MMSE of X given Y. 
Oo 


From the definition, we see how to calculate E[X|Y] from the conditional density 
of X given Y. However, in many cases one can calculate E[X|Y] more simply. One 
approach is to use the following properties of conditional expectation. 


176 


Theorem 9.5 (Properties of Conditional Expectation) 
(a) Linearity: 
E[aiX1 + a2X2]Y] = ai E[X1]Y] + a2 E[X2]|Y]; 
(b) Factoring Known Values: 
E[h(Y)X|Y] = h(Y) E[X|Y ]; 

(c) Independence: If X and Y are independent, then 

E[X|Y] = E(X). 
(d) Smoothing: 

E(E[X|Y]) = E(X); 

(e) Tower: 


E[E[XIY, Z]|Y] = E[X|Y]. 


Proof 
(a) By Lemma 9.4(b), it suffices to show that 


a1 X1 + a2 X» — (a4 E[X1]Y] + a E[X2|Y]) 
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is orthogonal to 4 (Y). But this is immediate since it is the sum of two terms 


aj(X; — E[X;|Y]) 


for i = 1, 2 that are orthogonal to 2 (Y). 
(b) By Lemma 9.4(b), it suffices to show that 


h(Y)X — h(Y)E[X|Y] 
is orthogonal to 4 (Y), i.e., that 


E((h(Y)X — h(Y)E[X|YDó(Y)) = 0, VO). 
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Now, 
E((h(Y)X — h(Y)E[X|Y Do (Y)) = E(X — E[X|Y Dh (Y) 9 (Y)) = 0, 


because X — E[X|Y] is orthogonal to 4 (Y) and therefore to h(Y)@(Y). 
(c) By Lemma 9.4(b), it suffices to show that 


X — E(X) 
is orthogonal to 9 (Y). Now, 
E((X — E(X))9(Y)) = E(X — E(X) E($(Y)) = 0. 


The first equality follows from the fact that X — E (X) and $ (Y) are independent 
since they are functions of independent random variables. 
(d) Letting $ (Y) = 1 in (9.8), we find 


E(X — E[X|Y]) — 0, 


which is the identity we wanted to prove. 

The projection property states that E[W|Y] = V if V is a function of Y and if 
W —V is orthogonal to 4 (Y). Applying this characterization to W = E[X|Y, Z] 
and V = E[X|Y], we find that to show that E[E[X|Y, Z]|Y] = E[X|Y], it 
suffices to show that E[X|Y, Z] — E[X|Y] is orthogonal to 4 (Y). That is, we 
should show that 


(e 


— 


E(h(Y)(E[X|Y, Z] - E[X|Y)) = 0 
for any function h(Y). But E(h(Y)(X — E[X|Y, Z)) = 0 by the projec- 
tion property, because h(Y) is some function of (Y, Z). Also, E(h(Y)(X — 
E[X|Y])) = 0, also by the projection property. Hence, 


E(h(Y)(E[X|Y, Z]- E[X| Y ) 2 E(h (Y) (X — ELX|Y ) 
—E(h(Y)(X—E[X|Y, Z]))=0. 


As an example, assume that X, Y, Z are i.i.d. U [O, 1]. We want to calculate 


E[(X + 2Y)|Y]. 


4See Appendix B. 
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We find 
E[(X - 2YY|Y] = E[X? + AY? - AXY|Y] 
= E[X?|Y] + 4AE[Y?|Y] + AE(XY|Y], by linearity 
= E(X?) + AE[Y?|Y] + AE[XY|Y], by independence 


= E(X?) +4Y? + 4Y E[X|Y], by factoring known values 


E(X?) +4Y? + AY E(X), by independence 


1 
=5 + AY? + 2Y, since X =p U[O, 1]. 


Note that calculating the conditional density of (X + 2Y)? given Y would have 
been quite a bit more tedious. 

In some situations, one may be able to exploit symmetry to evaluate the 
conditional expectation. Here is one representative example. Assume that X, Y, Z 
are 1.1.d. Then, we claim that 


E[XIX +Y + ZI = (K+ 2). (9.10) 
To see this, note that, by symmetry, 
E[X|X +Y + Z] = E[Y|X +Y + Z] = E[Z|X +Y t Z]. 
Denote by V the common value of these random variables. Note that their sum is 
3V = E[X - Y + ZIX t- Y - Z]. 


by linearity. Thus, 3V = X + Y + Z, which proves our claim. 


9.6.1 MMSE for Jointly Gaussian 


In general L[X|Y] # E[X|Y]. As a trivial example, Let Y =p U[—1, 1] and X = 
Y?. Then E[X|Y] = Y? and L[X|Y] = E(X) = 1/3 since cov(X, Y) = E(XY) — 
E(X)E(Y) = 0. 

Figure 9.12 recalls that E[X|Y] is the projection of X onto 4 (Y), whereas 
L[X|Y] is the projection of X onto .Z(Y). Since (Y) is a subspace of 4 (Y), 
one expects the two projections to be different, in general. 

However, there are examples where E[X|Y] happens to be linear. We saw one 
such example in (9.10) and it is not difficult to construct many other examples. 
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Fig. 9.12 The MMSE and 
LLSE are generally different 


L(Y) = {a+ bY 
G(Y) = (g(Y)| g(-) is a function} 


a,b € R} 


There is an important class of problems where this occurs. It is when X and Y 
are jointly Gaussian. We state that result as a theorem. 


Theorem 9.6 (MMSE for Jointly Gaussian RVs) 
Let X,Y be jointly Gaussian random variables. Then 


E[X|Y] = L[X|¥] = EQO + € D ey. ey), 
var(Y) 


Proof Note that 
X — L[X|Y] and Y are uncorrelated. 


Also, X — L[X|Y] and Y are two linear functions of the jointly Gaussian random 
variables X and Y. Consequently, they are jointly Gaussian by Theorem 8.4 and 
they are independent by Theorem 8.3. 

Consequently, 


X — L[X|Y] and ¢(Y) are independent, 


for any $ (-), because functions of independent random variables are independent by 
Theorem B.11 in Appendix B. Hence, 


X — L[X|Y] and $ (Y) are uncorrelated, 


for any $ (-) by Theorem B.4 of Appendix B. 
This shows that 


X — L[X|Y] is orthogonal to 4 (Y), 


and, consequently, that L[X|Y] = E[X|Y]. o 
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9.7 Vector Case 

So far, to keep notation at a minimum, we have considered L[X|Y] and E[X|Y] 
when X and Y are single random variables. In this section, we discuss the vector 
case, i.e., L[X|Y] and E[X|Y] when X and Y are random vectors. The only difficulty 


is one of notation. Conceptually, there is nothing new. 


Definition 9.3 (LLSE of Random Vectors) Let X and Y be random vectors of 
dimensions m and n, respectively. Then 


L[X|Y] = Ay + b 
where A is the m x n matrix and b the vector in 3i" that minimize 
E(IIX — AY — bil’). 
o 
Thus, as in the scalar case, the LLSE is the linear function of the observations 
that best approximates X, in the mean squared error sense. 


Before proceeding, review the notation of Sect. B.6 for Xy and cov(X, Y). 


Theorem 9.7 (LLSE of Vectors) Let X and Y be random vectors such that Xy is 
nonsingular. 


(a) Then 
L[X|Y] = E(X) + cov(X, Y £z (Y — E(Y). (9.11) 
(b) Moreover, 
E(IX — L[X|Y]|I?) = tr(2x — cowX, Y) X, !cov(Y, X). (9.12) 


In this expression, for a square matrix M, tr(M) :— Y^; M;i i is the trace of the 
matrix. 


i" 
Proof 


(a) The proof is similar to the scalar case. Let Z be the right-hand side of (9.11). 
One shows that the error X — Z is orthogonal to all the linear functions of Y. 
One then uses that fact to show that X is closer to Z than to any other linear 
function A (Y) of Y. 


9.7 


(b) 
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First we show the orthogonality. Since E(X — Z) — 0, we have 
E((X — Z)(BY +b)’) = E((X — Z)(BY)’) = E(X — Z)Y')B’. 
Next, we show that E((X — Z)Y’) = 0. To see this, note that 
E(X — ZYY) = E(X — Z)(¥ - E(Y)) 
= E(X — E(X)(Y - E(Y)) 
— cov(X, YX'E(Y — E(Y)(Y — E(Y))) 
= cov(X, Y) — cov(X, Y) Z4! Ly = 0. 
Second, we show that Z is closer to X than any linear A (Y). We have 
E(X — A)|?) = E(X — hY) X — h(¥))) 
= E(X-—Z--Z-h(Y)'(X-Z-Z- h(Y)) 
= E(IX — ZI  EQIZ — A) + 25 (X — ZI (Z - &()). 


We claim that the last term is equal to zero. To see this, note that 
n 
E(X — Z!(Z - (Y) = $ E(X; — Zi)(Zi — hi(Y)). 
i-i 

Also, 

E((X; — Zi)(Zi - hi(Y)) = E(X — ZZ — h(Y)5i; 
and the matrix E((X — Z)(Z — h(Y))) is equal to zero since X — Y is orthogonal 
to any linear function of Y and, in particular, to Z — ^ (Y). 

(Note: an alternative way of showing that the last term is equal to zero is to 

write 

E((X — Z) (Z — h(Y)) = rE(X — ZY(Z — h(Y))) 20, 
where the first equality comes from the fact that tr(A B) — tr(B A) for matrices 
of compatible dimensions.) 
Let X := X — E[X|Y] be the estimation error. Thus, 

X = X - E(X) —cov(X, Y) Z4 (Y — E(Y). 


Now, if V and W are two zero-mean random vectors and M a matrix, 
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cov(V — MW) = E((V — MW)(V — MWY’) 
= E(VV' - 2MWV' + MWW' M’) 
= cov(V) — 2Mcov(W, V) + Mcov(W)M'. 


Hence, 


cov(X) = Xx — 2cov(X, Y) Ly !cov(Y, X) 
+ cov(X, Y) Z4 ! Dy Ly! cov(Y, X) 
= Xx — cov(X, Y)Zy'cov(Y, X). 


To conclude the proof, note that, for a zero-mean random vector V, 


E(IVII) = E(tr(VV^) = t(E(VV)) = (2y). 


9.8 Kalman Filter 


The Kalman Filter is an algorithm to update the estimate of the state of a system 
using its output, as sketched in Fig. 9.13. The system has a state X (n) and an output 


Y (n) at time n = 0,1,.... These variables are defined through a system of linear 
equations: 
X(n+1) = AX(n) + V(n),n > 0; (9.13) 
Y(n) = CX(n) + W(n),n > 0. (9.14) 


In these equations, the random variables (X(0), V(n), W(n),n > O} are all 
orthogonal and zero-mean. The covariance of V (n) is Xy and that of W (n) is Zw. 
The filter is developed when the variables are random vectors and A, C are matrices 
of compatible dimensions. 

The objective is to derive recursive equations to calculate 


X (n) = L[X(n)|Y(0), ..., Y(n)], n = 0. 


Fig. 9.13 The Kalman Filter System Filter 

computes the LLSE of the Y 

state of a system given the Y (n) X (n) 

past of its output XC (n) KF — 
Output 
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9.8.1 TheFilter 


Here is the result, due to Rudolf Kalman (Fig. 9.14), which we prove in the next 
chapter. Do not panic when you see the equations! 


Theorem 9.8 (Kalman Filter) One has 


X(n) = AX(n — D + K, [Y (n) — CAR (n — 1)] (9.15) 
Kn = SC (C80 + Es]! (9.16) 
Sn = AD, 4A + Sy (9.17) 
E = (I — Kp C) Sn. (9.18) 
Moreover, 
Sn = cov(X (n) — AX (n — 1)) and X, = cov(X(n) — X(n)). (9.19) 
L| 


We will give a number of examples of this result. But first, let us make a few 
comments. 


e Equations (9.15)-(9.18) are recursive: the estimate at time n is a simple linear 
function of the estimate at time n — 1 and of the new observation Y (n). 

e The matrix Kp is the filter gain. It can be precomputed at time 0. 

* The covariance of the error X (n) — X (n), Xn, can also be precomputed at time 
0: it does not depend on the observations (Y (0), ..., Y (n)). The estimate x (n) 
depends on these observations but the mean squared error does not. 

* If X(0) and the noise random variables are Gaussian, then the Kalman filter 
computes the MMSE. 

* Finally, observe that these equations, even though they look a bit complicated, 
can be programmed in a few lines. This filter is elementary to implement and this 
explains its popularity. 


Fig. 9.14 Rudolf Kalman, 
1930-2016 
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9.8.2 Examples 
In this section, we examine a few examples of the Kalman filter. 
Random Walk 


The first example is a filter to track a “random walk" by making noisy observations. 
Let 


X (n 4-1) = X(n)4+ V(n) (9.20) 
Y(n) = X(n) + W(n) (9.21) 
var(V (n)) = 0.04, var(W (n)) = 0.09. (9.22) 


That is, X (n) has orthogonal increments and it is observed with orthogonal noise. 
Figure 9.15 shows a simulation of the filter. The left-hand part of the figure shows 
that the estimate tracks the state with a bounded error. The middle part of the figure 
shows the variance of the error, which can be precomputed. The right-hand part of 
the figure shows the filter with the time-varying gain (in blue) and the filter with the 
limiting gain (in green). The filter with the constant gain performs as well as the one 
with the time-varying gain, in the limit, as justified by part (c) of the theorem. 


Random Walk with Unknown Drift 
In the second example, one tracks a random walk that has an unknown drift. This 
system is modeled by the following equations: 


Xi(n + 1) = Xi (n) + X2(n) + V(n) (9.23) 
X»(n + 1) = X»(n) (9.24) 
Y(n) 2 X1(n) + W(n) (9.25) 
var(V (n)) = 1, var(W (n)) = 0.25. (9.26) 


In this model, X2(n) is the constant but unknown drift and X4(n) is the value of 
the “random walk.” Figure 9.16 shows a simulation of the filter. It shows that the 


8 88531 


Fig. 9.15 The Kalman Filter for (9.20)-(9.22) 
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Fig. 9.16 The Kalman Filter for (9.23)-(9.26) 


filter eventually estimates the drift and that the estimate of the position of the walk 
is quite accurate. 


Random Walk with Changing Drift 
In the third example, one tracks a random walk that has changing drift. This system 
is modeled by the following equations: 


X1(n + 1) = X1(n) + Xo(n) + Vi(n) (9.27) 
X2(n + 1) = Xo(n) + Vo(n) (9.28) 
Y(n) 2 Xi1(n) + W(n) (9.29) 
var(Vi (n)) = 1, var(V2(n)) = 0.01, (9.30) 
var(W (n)) = 0.25. (9.31) 


In this model, X2(n) is the varying drift and X;(n) is the value of the “random 
walk.” Figure 9.17 shows a simulation of the filter. It shows that the filter tries to 
track the drift and that the estimate of the position of the walk is quite accurate. 


Falling Object 
In the fourth example, one tracks a falling object. The elevation Z (n) of that falling 
object follows the equation 


Z(n) = Z(0) + Sn — gn^/2 + V(n), n > 0, 
where S(0) is the initial vertical velocity of the object and g is the gravitational 


constant at the surface of the earth. In this expression, V (n) is some noise that 
perturbs the motion. We observe n(n) = Z(n) + W (n), where W (n) is some noise. 
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Fig. 9.17 The Kalman Filter for (9.27)-(9.31) 


Fig. 9.18 The Kalman Filter x 104 
for (9.32)-(9.35) 1 


-1.5 


Since the term —gn*/2 is known, we consider 
Xi (n) = Z(n) + gn?/2 and Y (n) = n(n) + gn? /2. 


With this change of variables, the system is described by the following equations: 


Xi(n+ 1) = Xi(n) + Xo(n) +: V(n) (9.32) 
X2(n + 1) = X2(n) (9.33) 
Y(n) = X1(n) + W(n) (9.34) 
var(Vi(n)) = 100 and var(W (n)) = 1600. (9.35) 


Figure 9.18 shows a simulation of the filter that computes xX (n) from which we 
subtract gt /2 to get an estimate of the actual altitude Z(n) of the object. 


9.11 Problems 187 
9.9 Summary 
* LLSE, linear regression, and MMSE; 
* Projection characterization; 
* MMSE of jointly Gaussian is linear; 
* Kalman Filter. 
9.9.1 Key Equations and Formulas 
LLSE L[X|Y] = E(X) + cov(X, Y)var(Y)-! (Y — E(Y)) Theorem 9.1 
Orthogonality X-—L[X|Y] La +bY (9.6) 
Linear Regression converges to L[X|Y] Theorem 9.2 
Conditional Expectation E[X|Y] — ... Definition 9.2 
Orthogonality X — E[X|Y] L g(Y) Lemma 9.4 
MMSE = CE MMSE[X|Y] = E[X|Y] Theorem 9.3 
Properties of CE Linearity, smoothing, etc. . . Theorem 9.5 
CE for J.G. If X, Y J.G., then E[X|Y] = L[X|Y] =--- Theorem 9.6 
LLSE vectors L[X|Y] = E(X) + Ixy Fy" (Y — E(Y)) Theorem 9.7 
Kalman Filter X(n) = AX(n — D 4 K,[Y(@) — CAX(n —])] | Theorem 9.8 


9.10 References 


LLSE, MMSE, and linear regression are covered in Chapter 4 of Bertsekas and 
Tsitsiklis (2008). The Kalman filter was introduced in Kalman (1960). The text 
(Brown and Hwang 1996) is an easy introduction to Kalman filters with many 


examples. 


9.11 Problems 


Problem 9.1 Assume that X, = Y, + 2Y? + Z, where the Y, and Z, are i.i.d. 


U[O, 1]. Let also X = X4 and Y = Yj. 


(a) Calculate L[X|Y] and E((X — L[X|Y])?); 


(b) Calculate Q[ X|Y] and E((X — Q[X|Y]?) where Q[X|Y] is the quadratic least 


squares estimate of X given Y. 
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(c) Design a stochastic gradient algorithm to compute Q[X|Y] and implement it in 
Python. 


Problem 9.2 We want to compare the off-line and on-line methods for computing 
L[X|Y ]. Use the setup of the previous problem. 


(a) Generate N — 1, 000 samples and compute the linear regression of X given Y. 
Say that this is X = aY + b 

(b) Using the same samples, compute the linear fit recursively using the stochastic 
gradient algorithm. Say that you obtain X = cY +d 

(c) Evaluate the quality of the two estimates your obtained by computing E((X — 
aY — b?) and E((X — cY — d)?). 


Problem 9.3 The random variables X, Y, Z are jointly Gaussian, 
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(a) Find E[X|Y, Z]; 
(b) Find the variance of error. 


Problem 9.4 You observe three i.i.d. samples X1, X2, X3 from the distribution 
fxp(x) = ge HEB where 6 € R is the parameter to estimate. Find 
MLET[0|X|, X2, X3]. 

Problem 9.5 


(a) Given three independent N(0,1) random variables X, Y, and Z, find the 
following minimum mean square estimator: 


E[X + 3Y|2Y + 5Z]. 
(b) For the above, compute the mean squared error of the estimator. 


Problem 9.6 Given two independent N (0, 1) random variables X and Y, find the 
following linear least square estimator: 


E[X[X? +Y]. 
Hint: The characteristic function of a N (0, 1) random variable X is as follows: 


E (e**) Fa eis 
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Problem 9.7 Consider a sensor network with n sensors that are making observa- 
tions Y" = (Y1, ..., Yn) of a signal X where 


Yi =aX+Z;,i=1,...,n. 


In this expression, X =p N(0,1), Zi =p N(0, o2), fori = 1,...,n and these 
random variables are mutually independent. 


(a) Compute the MMSE estimator of X given Y". 
(b) Compute the mean squared error o2 of the estimator. 
(c) Assume each measurement has a cost C and that we want to minimize 


nC c o7. 


Find the best value of n. 
(d) Assume that we can decide at each step whether to make another measurement 
or to stop. Our goal is to minimize the expected value of 


vC +02, 


where v is the random number of measurements. Do you think there is a decision 
rule that will do better than the deterministic value n derived in (c)? Explain. 


Problem 9.8 We want to use a Kalman filter to detect a change in the popularity of 
a word in twitter messages. To do this, we create a model of the number Y,, of times 
that particular word appears in twitter messages on day n. The model is as follows: 


X(n+ 1) = X(n) 
Y(n) = X(n) + W(n), 


where the W(n) are zero-mean and uncorrelated. This model means that we are 
observing numbers of occurrences with an unknown mean X (n) that is supposed 
to be constant. The idea is that if the mean actually changes, we should be able to 
detect it by noticing that the errors between Y (n) and Y (n) are large. Propose an 
algorithm for detecting that change and implement it in Python. 


Problem 9.9 The random variable X is exponentially distributed with mean 1. 
Given X, the random variable Y is exponentially distributed with rate X. 


(a) Calculate E[Y | X]. 
(b) Calculate E[X|Y]. 
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Problem 9.10 The random variables X, Y, Z are i.i.d. // (0, 1). 
(a) Find L[X? + Y?|X + Y]; 

(b) Find E[X + 2Y|X + 3Y + AZ]; 

(c) Find E[(X + Y|X — Y]. 


Problem 9.11 Let (V,, n > 0) be i.i.d. N(0,62) and independent of Xp = 
N(0, u?). Define 


Xn41 =aX,4+ Vn, n= 0. 
1. What is the distribution of X, for n > 1? 
2. Find E[Xn+m|Xn] for0 € n <n+m. 


3. Find u so that the distribution of X,, is the same for all n > 0. 


Problem 9.12 Let 0 =p U[0, 1], and given 6, the random variable X is uniformly 
distributed in [0, 0]. Find E[0|X]. 


Problem 9.13 Let (X, Y)" ~ N([0; 0], [3, 1; 1, 1]). Find E[X?|Y ]. 


Problem 9.14 Let (X,Y, Z)? ~  N([0:0;0],[5,3, 1; 3,9, 3; 1, 3, 1]. Find 
E[XIY, Z]. 


Problem 9.15 Consider arbitrary random variables X and Y. Prove the following 
property: 


var(Y) = E(var[Y|X]) + var(E[Y | X]). 


Problem 9.16 Let the joint p.d.f. of two random variables X and Y be 


1 
= Gr eee HHO S y ur 


First show that this is a valid joint p.d.f. Suppose you observe Y drawn from this 
joint density. Find MMSE[X|Y]. 


Problem 9.17 Given four independent N (0, 1) random variables X, Y, Z, and V, 
find the following minimum mean square estimate: 


E[X -- 2Y +3Z|Y 4- 5Z + AV]. 
Find the mean squared error of the estimate. 


Problem 9.18 Assume that X,Y are two random variables that are such that 
E[X|Y] = L[X|Y]. Then, it must be that (choose the correct answers, if any) 
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X and Y are jointly Gaussian; 

X can be written as X = aY + Z where Z is a random variable that is 
independent of Y ; 

E((X — L[X|YDY^) = 0 for all k > 0; 

E((X — L[X|Y]) sin(3Y + 5)) = 0. 


Problem 9.19 In a linear system with independent Gaussian noise, with state X; 
and observation Y; , the Kalman filter computes (choose the correct answers, if any) 


MLE|Y,|X"]; 
MLE|[X,|Y"]; 
MAPIY,|X"]; 
MAP[X,|Y"] 
E[X,]Y"]; 


^ 


Problem 9.20 Let (X, Y) where Y’ = [Yj, Y2, Ya, Y4] be N(u, X) with w = 
[2, 1, 3, 4, 5] and 


3 4 6 12 8 
4 6 9 18 12 
6 9 1428 18 
12 18 28 56 36 
8 12 18 36 24 


y= 


Find E[X|Y]. 


Problem 9.21 Let X — AV and Y — CV where V — N(0, D. 
Find E[X|Y]. 


Problem 9.22 Given 0 € (0, 1), X = N(0, X5) where 


|. [10 _|lop 
m= {51 TEBAL 


where p > 0 is given. 
Find MLE[6|X]. 


Problem 9.23 Given two independent N (0, 1) random variables X and Y, find the 
following linear least square estimator: 


LXIX? +Y]. 
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Hint: The characteristic function of a N (0, 1) random variable X is as follows: 
E(e*X) = e72", 
Problem 9.24 Let X, Y, Z be i.i.d. “~ (0, 1). Find 
E[X|X - Y, X - Z. Y — Z]. 
Hint: Argue that the observation Y — Z is redundant. 
Problem 9.25 Let X, Y1, Ys, Y3 be zero-mean with covariance matrix 


10 6 
x- 6 9 
5 6 


16 21 18 


Find L[X|Y, Y2, Y3]. Hint: You will observe that Xy is singular. This means that at 
least one of the observations Y1, Y2, or Y3 is redundant, i.e., is a linear combination 
of the others. This implies that L[X|Y,, Yo, Ya] = L[X|Yi, Y2]. 
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Topics: Derivation and properties of Kalman filter; Extended Kalman filter 


10.1 Updating LLSE 


In many situations, one keeps making observations and one wishes to update 
the estimate accordingly, hopefully without having to recompute everything from 
scratch. That is, one hopes for a method that enables to calculate L[X|Y, Z] from 
L[X|Y] and Z. 

The key idea is in the following result. 


Theorem 10.1 (LLSE Update—Orthogonal Additional Observation) Assume 
that X, Y, and Z are zero-mean and that Y and Z are orthogonal. Then 


L[X|Y, Z] = L[X|Y] + L[X|Z]. (10.1) 
a 
Proof Figure 10.1 shows why the result holds. To be convinced mathematically, we 
need to show that the error 
X — (LIXIY] + L[X|Z]) 
is orthogonal to Y and to Z. To see why it is orthogonal to Y, note that the error is 
© The Author(s) 2021 193 


J. Walrand, Probability in Electrical Engineering and Computer Science, 
https://doi.org/10.1007/978-3-030-49995-2 10 


194 


= 


0 Tracking: B 


Fig. 10.1 The LLSE is easy to update after an additional orthogonal observation 


(X — L[X|Y] — L[X|Z]. 
Now, the term between parentheses is orthogonal to Y, by the projection property 
of L[X|Y]. Also, the second term is linear in Z, and is therefore orthogonal to Y 
since Z is orthogonal to Y. One shows that the error is orthogonal to Z in the same 
way. im 


A simple consequence of this result is the following fact. 


Theorem 10.2 (LLSE Update—General Additional Observation) Assume that 
X, Y, and Z are zero-mean. Then 


L[X|Y, Z] = L[X|Y] + L[X|Z — L[Z|Y]]. (10.2) 
a 
Proof The idea here is that one considers the innovation Z:z- L[Z|Y], which 
is the information in the new observation Z that is orthogonal to Y. 
To see why the result holds, note that any linear combination of Y and Z can be 
written as a linear combination of Y and Z. For instance, if L[Z|Y] = CY, then 


AY + BZ = AY + B(Z — CY) + BCY = (A+ BC)Y + BZ. 


Thus, the set of linear functions of Y and Z is the same as the set of linear functions 
of Y and Z, so that 


L[X|Y, Z] = LIXIY, Z]. 


Thus, (10.2) follows from Theorem 10.1 since Y and Z are orthogonal. oO 


10.2 Derivation of Kalman Filter 


10.2 Derivation of Kalman Filter 


195 


We derive the equations for the Kalman filter, as stated in Theorem 9.8. For 


convenience, we repeat those equations here: 
X(n) = AX(n — D + K, [Y (n) — CAX(n — 1)] 
Kn = S,C'[CS,C' + Zw] ! 
Sn = AXA’ + Dy 
Xn = ü = KnC)Sn 


and 


Sn = cov(X (n) — AÑ (n — 1)) and E, = cov(X (n) — X(n)). 


In the algebra, we repeatedly use the fact that 
cov(BV, DW) = B cov(V, W)D' 
and also that if V and W are orthogonal, then 
cov(V + W) = cov(V) + cov(W). 


The algebra is a bit tedious, but the key steps are worth noting. 
Let 


Y" = (Y(0), ..., Y(n)). 


Note that 


(10.16) 
(10.17) 
(10.18) 
(10.19) 


(10.20) 


L [xen] =L [AX -1)+V(n— D=] = A(n — 1). 


Hence, 


L [yoo] =L [exe + wool] =CL [xeoir"-! — CAÉ(n- 1), 


so that, by Theorem 10.2, 
Y(n)- L [yoo] — ¥(n) — CAÉ(n — 1). 


Thus, 
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Xa) = LIX” = L [x eoi" ] 1 [x eot en - L [rwr |] 
— AÉ(n — 1) - K, [yo ~CAX(n— D] , 
This derivation shows that (10.16) is a fairly direct consequence of the formula in 


Theorem 10.2 for updating the LLSE. 
The calculation of the gain K, is a bit more complex. Let 


fo) -YOG)-L [yoo] — Y(n) — CAÉ(n — 1). 
Then 
K, = cov (x, Ya) cov TON . 
Now, 
cov (x, fa) — cov (x) -L [xeoir"-!] Ya) 
because Y (n) is orthogonal to Y"-!. Also, 


cov(X(n) — L [xen] Ya) 
= cov(X(n) — AR (n — D, Y (n) - CAX(n — D) 
= cov(X (n) — AR (n — 1), CX (n) + W(n) — CAX(n — 1) 
= $,C’, 


by (10.20). : 
To calculate cov(Y (n)), we note that 


cov(Y (n)) = cov (cx 4: W(n)- CL [x eoi"! ]) = CSC + Ey. 
Thus, 
Kn = $C [CS,C + Sw] 
To show (10.18), we note that 
Sn = cov (xo -L [xix |) 
— (AX —-D-cVa-1-AXQ- » 


= AX,1A' T Xy. 
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Finally, to derive (10.19), we calculate 
E, = coy (x(n) - 2w) , 
We observe that 


X(n) — LIX Q)|Y"] = X(n) — A(n — 1) — K, [yo -CAXG- D) 


=X(n)-A(n-1)— K, [exe + Wn) — CA£( — D] 
= [I — K,C] [xem — AÉ (n — 1] — K,W(n), 
so that 


En = [I — KC Ecc] x E, 
Eod C + Kn CSC + £w] K! 
= Sy — 2KnC Sn + Kn [CS,C' + Zw] [CSnC’ + Zw] | CS, by (10.17) 
Shes, 


as we wanted to show. 


10.3 Properties of Kalman Filter 


The goal of this section is to explain and justify the following result. The terms 
observable and reachable are defined after the statement of the theorem. 


Theorem 10.3 (Properties of the Kalman Filter) 
(a) If (A, C) is observable, then X, is bounded. Moreover, if Xo = 0, then 
Xn > X and Kn > K, (10.37) 


where X is a finite matrix. 
(b) Also, if in addition, (A, a) is reachable, then the filter with K, = K is such 


that the covariance of the error also converges to X. 
a 


We explain these properties in the subsequent sections. Let us first make a few 
comments. 


198 10 Tracking: B 


* For some systems, the errors grow without bound. For instance, if one does not 
observe anything (e.g., C = 0) and if the system is unstable (e.g., X(n) = 
2X(n — 1) + V(n)), then X, goes to infinity. However, (a) says that “if the 
observations are rich enough,” this does not happen: one can track X (n) with an 
error that has a bounded covariance. 

* Part (b) of the theorem says that in some cases, one can use the filter with 
a constant gain K without having a bigger error, asymptotically. This is very 
convenient as one does not have to compute a new gain at each step. 


10.3.1 Observability 


Are the observations good enough to track the state with a bounded error covari- 
ance? Before stating the result, we need a precise notion of good observations. 


Definition 10.1 (Observability) We say that (A, C) is observable if the null space 
of 


C Ad 


is {0}. Here, d is the dimension of X (n). A matrix M has null space {0} if {0} is the 
only vector v such that Mv — 0. 


The key result is the following. 
Lemma 10.4 (Observability Implies Bounded Error Covariance) 


(a) If the system is observable, then X, is bounded. 
(b) If in addition, Xo = 0, then X, converges to some finite X. 


Proof 


(a) Observability implies that there is only one X(0) that corresponds to 
(Y (0), ..., Y (d)) if the system has no noise. Indeed, in that case, 


X(n) = AX(n — 1) and Y (n) = CX (n). 


Then, 
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X(1) = AX(0), XQ) = A?X (0), ..., X(d) = AX! X (0), 
so that 


Y(0) = CX(0), Y(1) = CAX(0),..., Y(d — 1) = CA^! X(0). 


Consequently, 
Y(0) C 
YQ) CA 
: = ; X(0). 
Y (d) cat! 


Now, imagine that there are two different initial states, say X (0) and X (0) that 
give the same outputs Y (0), ..., Y (d). Then, 


Y (0) C C 
Y(1) CA CA f 
= : X(0) = : X (0), 
Y (d) CAT! CAT! 
so that 
C 
CA 


. 1G) - X(0) — 0. 
cad! 
The observability property implies that X (0) — X (0) = 0. 


Thus, if (A, C) is observable, one can identify the initial condition X (0) 
uniquely after d + 1 observations of the output, when there is no noise. Hence, 


when there is no noise, one can then determine X (1), X (2), ... exactly. Thus, 
when (A, C) is observable, one can determine the state X (7) precisely from the 
outputs. 


However, our system has some noise. If (A, C) is observable, we are 
able to identify X(0) from Y(0),..., Y(d), up to some linear function of 
the noise that has affected those outputs, i.e., up to a linear function of 
(V(0),..., V(d — 1), W(0),..., W(d)). Consequently, we can determine 
X (d) from Y (0),..., Y(d), up to some linear function of (V(0), ..., Vid — 
1), W(0),..., W(d)}. Similarly, we can determine X(n) from Y(n — 
d),..., Y(n), up to some linear function of (V(n),..., V(n +d — 1, W(n — 
d),...,W(n)). 
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This implies that the error between X (n) and X (n) is a linear combination of 
d noise contributions, so that 2, is bounded. 

(b) One can show that if Zo = 0, i.e., if we know X (0), then X, increases in the 
sense that X, — 2X,.., is nonnegative definite. Being bounded and increasing 
implies that 2, converges, and so does Ky. 

oO 


10.3.2 Reachability 
Assume that Xy = QQ’. We say that (A, Q) is reachable if the rank of 


[Q, AQ, ..., A1 Q] 


is full. To appreciate the meaning of this property, note that we can write the state 
equations as 


X(n) = AX (n — 1) t Qm, 


where cov(n,) = I. That is, the components of 7 are orthogonal. In the Gaussian 
case, the components of n are N (0, 1) and independent. If (A, Q) is reachable, this 
means that for any x € RI, there is some sequence no, ..., na such that if X (0) = 0, 
then X (d) — x. Indeed, 


d Nd 
X@ = Y A‘ Ona - |9. 40..... 4*7 0] | ng! 
k=0 
No 


Since the matrix is full rank, the span of its columns is 917, which means precisely 
that there is a linear combination of these columns that is equal to any given vector 
in 97. 

The proof of part (b) of the theorem is a bit too involved for this course. 


10.4 Extended Kalman Filter 


The Kalman filter is often used for nonlinear systems. The idea is that if the system 
is almost linear over a few steps, then one may be able to use the Kalman filter 
locally and change the matrices A and C as the estimate of the state changes. 

The model is as follows: 


X(n 4-1) 2 f(X(n) + Vin) 
Y(n--1)2 g(X(n-4 1) 4- W(n 4 1). 
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The extended Kalman filter is then 


Xo D= f (£0) k [vo e 0-e(r&»)] 


-1 
Kn = SnCy, [CaSnC, + Zw] 
Sn = An Xs Al, + Xy 
Dnt = [I = KnCn]Sn, 


where 


Za (kw). 


ə " 
[Anlij = — fi (2) and [Cnlij = ax; 


Ox; 


Thus, the idea is to linearize the system around the estimated state value and then 
apply the usual Kalman filter. 

Note that we are now in the realm of heuristics and that very little can be said 
about the properties of this filter. Experiments show that it works well when the 
nonlinearities are small, whatever this means precisely, but that it may fail miserably 
in other conditions. 


10.4.1 Examples 


Tracking a Vehicle 

In this example, borrowed from “Eric Feron, Notes for AE6531, Georgia Tech-", the 
goal is to track a vehicle that moves in the plane by using noisy measurements of 
distances to 9 points p; € 97. Let p(n) € 9? be the position of the vehicle and 
u(n) € €? be its velocity at time n > 0. 


We assume that the velocity changes accruing to a known rule, except for some 
random perturbation. Specifically, we assume that 


p( 4-1) = p(n) + O.1u(n) (10.38) 
u(n 4 1)— kr a u(n) + w(n), (10.39) 


where the w(n) are i.i.d. N (0, I). The measurements are 
yi(n) = ||p@) — pill + vitm), i = 1,2,....9, 
where the v; (n) are i.i.d. N (0, 0.32). 


Figure 10.2 shows the result of the extended Kalman filter for X(n) — 
(p(n), u(n)) initialized with x(0) = 0 and Zo = I. 
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pi 
Fig. 10.2 The Extended Kalman Filter for the system (10.38)-(10.39) 
Fig. 10.3 The chemical ki 
reactions A — B + C 
ka 
EN 
-2 


Tracking a Chemical Reaction 

This example concerns estimating the state of a chemical reactor from measure- 
ments of the pressure. This example is borrowed from James B. Rawlings and 
Fernando V. Lima, U. Wisconsin, Madison. There are three components A, B, C 
in the reactions and they are modeled as shown in Fig. 10.3 where the k; are the 
kinetic constants. 


Let C4, Cp, Cc be the concentrations of the A, B, C, respectively. The model is 


—1 
d Es u 1 p i we d 
ER B — = ota 
dt Ce 1 1 knCp k_2Cc 


and 
y = RT(CA4 t Cp + Cc). 


As shown in the top part of Fig. 10.4, this filter does not track the concentrations 
correctly. In fact, some concentrations that the filter estimates are negative! 
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Fig. 10.4 The top two graphs show that the extended Kalman filter does not track the concentra- 
tions correctly. The bottom two graphs show convergence after modifying the equations 


The bottom graphs show that the filter tracks the concentrations converge after 
modifying the equations and replacing negative estimates by 0. 

The point of this example is that the extended Kalman filter is not guaranteed to 
converge and that, sometimes, a simple modification makes it converge. 


10.5 Summary 


* Updating LLSE; 

* Derivation of Kalman Filter; 

e Observability and Reachability; 
* Extended Kalman Filter. 
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10.5.1 Key Equations and Formulas 


Updating LLSE & zero-mean => L[X|Y, Z] = LI[X|Y] + L[X|Z — L[Z|Y]] T. 10,2 


Observability => bounded error covariance L.10.4 
Observability + Reachability => asymptotic filter is good enough T.9.8 
Extended Kalman Filter Linearize equations S.10.4 


10.6 References 


The book Goodwin and Sin (2009) survey filtering and applications to control. The 
textbook Kumar and Varaiya (1986) is a comprehensive yet accessible presentation 
of control theory, filtering, and adaptive control. It is available online. 
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Application: Recognizing Speech 
Topics: Hidden Markov chain, Viterbi decoding, EM Algorithms 


11.1 Learning: Concepts and Examples 


In artificial intelligence, “learning” refers to the process of discovering the relation- 
ship between related items, for instance between spoken words and sounds heard 
(Fig. 11.1). 

As a simple example, consider the binary symmetric channel example of Problem 
7.5 in Chap. 7. The inputs X, are i.i.d. B(p) and, given the inputs, the output Y, is 
equal to X, with probability 1 — €, for n > 0. In this example, there is a probabilistic 
relationship between the inputs and the outputs described by e. Learning here refers 
to estimating e. 

There are two basic situations. In supervised learning, one observes the inputs 
{Xn n =0,..., N} and the outputs (Y;, n = 0,..., N}. One can think of this form 
of learning as a training phase for the system. Thus, one observes the channel with 
a set of known input values. Once one has "learned" the channel, i.e., estimated e, 
one can then design the best receiver and use it on unknown inputs. In unsupervised 
learning, one observes only the outputs. The benefit of this form of learning is that 
it takes place while the system is operational and one does not “waste” time with 
a training phase. Also, the system can adapt automatically to slow changes of € 
without having to re-train it with a new training phase. 

As you can expect, there is a trade-off when choosing supervised versus 
unsupervised learning. A training phase takes time but the learning is faster than 
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Fig. 11.1 Can you hear me? 


in unsupervised learning. The best method to use depends on characteristics of the 
practical situation, such as the likely rate of change of the system parameters. 


11.2 Hidden Markov Chain 


A hidden Markov chain is a Markov chain together with a state observation model. 
The Markov chain is (X (n), n > 0} and it has its transition matrix P on the state 
space 2 and its initial distribution 7o. The state observation model specifies that 
when the state of the Markov chain is x, one observes a value y with probability 
Q(x, y), for y € Y. More precisely, here is the definition (Fig. 11.2). 


Definition 11.1 (Hidden Markov Chain) A hidden Markov chain is a random 
sequence ((X (n), Y (n)), n > 0] such that X(n) €e X = (1,..., N} and Y(n) € 
9 = (1,..., M} and 


P(X(0) = xo, Y (0) = yo, ..., X(n) = Xn, Y (n) = Yn) 
= zto(xo) Q (xo, yo) P (xo, x1) Q (x1, y1) X -++ X P(Xn-1, Xn) O(n, Yn), 
for all n > 0, Xm € Z, Ym E Z. (11.1) 


o 
In the speech recognition application, the X, are “parts of speech,” i.e., segments 
of sentences, and the Y, are sounds. The structure of the language determines 
relationships between the X, that can be approximated by a Markov chain. The 
relationship between X, and Y, is speaker-dependent. 
The recognition problem is the following. Assume that you have observed that 
Y" := (Yo, ..., Yn) = y" := (yo, ..., Yn). What is the most likely sequence X" := 
(Xo, ..., Xn)? That is, in the terminology of Chap. 7, we want to compute 


MAP[X" | Y" 5 y^]. 


Thus, we want to find the sequence x" € 2"*! that maximizes 
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Fig. 11.2 The hidden 
Markov chain 


Q > Y(n) 


Note that 


P[X^ —-— | y= yl = P(X” = x", Y" = y!) 
PY? = y") 


The MAP is the value of x” that maximizes the numerator. Now, by (11.1), the 
logarithm of the numerator is equal to 


n 


logro Go) Q (xo, yo)) + X log(P (Xm-1, Xm) Q Xm, Ym)). 


m=1 


Define 


d(xo) = — logGto(xo) Q (xo. yo)) 


and 


din (Xm—1, Xm) = — log(P (Xm—1, Xn) Q (Xm, Ym)). 


Then, the MAP is the sequence x” that minimizes 


d(xo) + È` dm (Xm—15 Xm). (11.2) 


m=1 


The expression (11.2) can be viewed as the length for a path in the graph shown 
in Fig. 11.3. Finding the MAP is then equivalent to solving a shortest path problem. 
There are a few standard algorithms for solving such problems. We describe the 
Bellman—Ford Algorithm due to Bellman (Fig. 11.4) and Ford. 

For m = 0,...,n and x € X, let V,,(x) be the length of the shortest path from 
X (m) = x to the column X (n) in the graph. Also, let V,(x) = Oforall x e X. 
Then, one has 


Vn) = min {dni x) + Vma}, x€ X, m=0,...,n—1. (13 


LEK 
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X(0) X(1) X(k-1) X(k) X(n) 
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Fig. 11.3 The MAP as a shortest path 


Fig. 11.4 Richard Bellman, 
1920-1984 


Finally, let 
V = min{do(x) + Vo(x)]. (11.4) 
xe 


Then, V is the minimum value of expression (11.2). 
The algorithm is then as follows: 


Step (1): Calculate (V, (x), x € 2} recursively for m = n — 1,n — 2, ...,0, using 
(11.3). At each step, note the arc out of each x that achieves the minimum. Say 
that the arc out of xm = x goes to x41 = s(m, x) forx e X. 

Step (2): Find the value xo that achieves the minimum in (11.4). 

Step (3): The MAP is then the sequence 


X0, X1 = s(0, xo), x2 = s(l, x1), <.. Xn = s(n im l, Xp—1). 


Equations (11.3) are the Bellman—Ford Equations. They are a particular version 
of Dynamic Programming Equations (DPE) for the shortest path problem. 

Note that the essential idea was to define the length of the shortest remaining path 
starting from every node in the graph and to write recursive expressions for those 
quantities. Thus, one solves the DPE backwards and then one finds the shortest path 
forward. This application of the shortest path algorithm for finding a MAP is called 
the Viterbi Algorithm due to Andrew Viterbi (Fig. 11.5). 
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Fig. 11.5 Andrew Viterbi, b. 
1934 


11.3 Expectation Maximization and Clustering 


Expectation maximization is a class of algorithms to estimate parameters of 
distributions. We first explain these algorithms on a simple clustering problem. We 
apply expectation maximization to the HMC model in the next section. 

The clustering problem consists in grouping sample points into clusters of 
"similar" values. We explain a simple instance of this problem and we discuss the 
expectation maximization algorithm. 


11.3.1 A Simple Clustering Problem 


You look at set of N exam results (X (1), ..., X(N)) in your probability course 
and you must decide who are the A and the B students. To study this problem, we 
assume that the results of A students are i.i.d. (a, o?) and those of B students are 
N (b, o?) where a > b. 

For simplicity, assume that we know o? and that each student has probability 0.5 
of being an A student. However, we do not know the parameters (a, b). 

(The same method applies when one does not know the variances of the scores 
of A and B students, nor the prior probability that a student is of type A.) 

One heuristic is as follows (see Fig. 11.6). Start with a guess (a, b1) for (a, b). 
Student n with score X (n) is more likely to be of type A if X(n) > (aj + b1)/2. 
Let us declare that such students are of type A and the others are of type B. Let then 
az be the average score of the students declared to be of type A and bz that of the 
other students. We repeat the procedure after replacing (a1, b1) by (a2, b2) and we 
keep doing this until the values seem to converge. This heuristic is called the hard 
expectation maximization algorithm. 

A slightly different heuristic is as follows (see Fig. 11.7). Again, we start with a 
guess (a1, b1). 

Using Bayes’ rule, we calculate the probability p(n) that student n with score 
X (n) is of type A. We then calculate 

in Xn) p(n) 25, XA — p(n) 
= =“ and b = . 
din p(n) 3. — pi) 


a2 
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Fig. 11.6 Clustering with 
hard EM. The initial guess is | | 


(a1, b1), which leads to the n di d 
MAP of the types and the 45-96 obo ce anes 
next guess (a2, b2), and so on | | 
| J d 
bo a2 
09-00—9-6 090-090-09—» 
» 
bs a3 


Fig. 11.7 Clustering with 
soft EM. The initial guess is | | 
(a1, b1), which leads to the l di 
probabilities of the types and a 0 a1, b1) 

the next guess (a2, b2), and a7 
so on 


We then repeat after replacing (ai, b1) by (a2, b2). Thus, the calculation of az 
weighs the scores of the students by the likelihood that they are of type A, and 
similarly for the calculation of b2. 

This heuristic is called the soft expectation maximization algorithm. 


11.3.2 A Second Look 


In the previous example, one attempts to estimate some parameter 0 = (a, b) based 
on some observations X = (X1,..., Xy). Let Z = (Z1,..., ZN) where Z, = A if 
student n is of type A and Z, — B otherwise. 

We would like to maximize f [x|0] over 0, to find MLE[0|X = x]. One has 


fixie] = 5 ^ fIxiz, 61 P[zJ0], 


where the sum is over the 2" possible values of Z. This is computationally too 
difficult. 
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Fig. 11.8 Hard and soft 
EM? 


Hard EM (Fig. 11.8) replaces the sum over z by 
fFIxiz*, 6] P[z*]0], 


where z* is the most likely value of Z given the observations and a current guess for 
0. That is, if the current guess is 04, then 


z' = MAP[Z|X = x, 0%] = arg max P[Z = z|X = x, 6,]. 
Zz 
The next guess is then 
k+ı = arg max f[x|2*, 61P[z*10]. 


Soft EM makes a different approximation. First, it replaces 


log( f[x|6]) = log (x fIxiz. nna) 


by 


>> log( fExlz, 01) Pale]. 


That is, it replaces the logarithm of an expectation by the expectation of the 
logarithm. 
Second, it replaces the expression above by 


Yo log( fixlz, 41) Plalx, 6] 


and the new guess 6;1 is the maximizer of that expression over 0. Thus, it replaces 
the distribution of Z by the conditional distribution given the current guess and the 
observations. 
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If this heuristic did not work in practice, nobody would mention it. Surprisingly, it 
seems to work for some classes of problems. There is some theoretical justification 
for the heuristic. One can show that it converges to a local maximum of f[x|6]. 
Generally, this is little comfort because most problems have many local maxima. 
See Roche (2012). 
11.4 Learning: Hidden Markov Chain 
Consider once again a hidden Markov chain model but assume that (zz, P, Q) are 
functions of some parameter 0 that we wish to estimate. We write this explicitly as 
(xo, Po, Qo). We are interested in the value of 0 that makes the observed sequence 
y" most likely. 

Recall that MLE of 0 given that Y" — y" is defined as 


MLE|[0|Y" = y"] = arg max P[Y" = y" | 6]. 
As in the discussion of clustering, we have 
P[Y'2y"|0]— > P[Y" = y" | X" 2 x", 9] P[X" = x" 6]. (11.5) 
x" 

11.4.1 HEM 
The HEM algorithm replaces the sum over x" by 

P[Y" = y" | X" = x}, 0] P[X" = x|6] 
and then P[X" = x!|6] by 

P[X" = x7|Y", 09], 
where 
x; = MAP[x"|Y", 00]. 

Recall that one can find x? by using Viterbi's algorithm. Also, 


P[Y" = y" | X" =x", 6] 
= Ta (x0) Qo (xo, yo) Oe (x1, y1) X +++ X Po(Xn-1, Xn) Qon, Yn). 
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11.4.2 Training the Viterbi Algorithm 


The Viterbi algorithm requires knowing P and Q. In practice, Q depends on the 
speaker and P may depend on the local dialect. (Valley speech uses more “likes” 
than Berkeley speakers.) We explained that if a parametric model is available, then 
one can use HEM. 

Without a parametric model, a simple supervised training approach where one 
knows both x" and y" is to estimate P and Q by using empirical frequencies. For 
instance, the number of pairs (Xm, Xm+1) that are equal to (a, b) in x" divided by 
the number of times that x,, — a provides an estimate of P (a, b). The estimation of 
Q is similar. 


11.5 Summary 


* Hidden Markov Chain; 

e Viterbi Algorithm for M AP[XI|Y]; 

* Clustering and Expectation Maximization; 
* EM for HMC. 


11.5.1 Key Equations and Formulas 


Definition of HMC X(n) MC & P[Y,|X,] D.11.1 
Bellman-Ford Equations V, (x) = miny(d(x, y) + Vi.10)] (11.3) 
EM, Soft and Hard 0 — z — x; Heuristics to compute M A P[0|x] 8.11.3 


11.6 References 


The text Wainwright and Jordan (2008) is great presentation of graphical models. It 
covers expectation maximization and many other useful techniques. 
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11.7 Problems 


Problem 11.1 Let (X, , Y,) be a hidden Markov chain. Let Y" = (Yo, ..., Y,) and 
X" = (Xo, ..., Xn). The Viterbi algorithm computes 


MLE[Y"|X"]; 
MLE[X"|Y"]; 
MAP(Y"|X"]; 
MAP[X"|Y"]. 


rana — mc 


Problem 11.2 Assume that the Markov chain X, is such that X = (a, b}, zto(a) = 
zo(b) = 0.5 and P(x, x’) = o for x Æ x’ and P(x,x) = 1 — a. Assume also 
that X, is observed through a BSC with error probability e, as shown in Fig. 11.9. 
Implement the Viterbi algorithm and evaluate its performance. 


Problem 11.3 Suppose that the grades of students in a class are distributed as a 
mixture of two Gaussian distribution, N (41, o?) with probability p and N (u2, 02) 
with probability 1 — p. All the parameters 0 = (41,01, (42, 02, p) are unknown. 


(a) You observe n i.i.d. samples, y1, ..., Yn drawn from the mixed distribution. Find 
FOL -+s Ynlð). 

(b) Let the type random variable X; be O if Y; ^ N(m, o?) and 1 if Y; ~ 
N (u2, 03). Find MAP[X;|Y;, 0]. 

(c) Implement Hard EM algorithm to approximately find MLET[0|Yi,...,Y,]. To 
this end, use MATLAB to generate 1000 data points (y1, ..., y1000), according 
to 0. = (10,4, 30,6,0.4). Use your data to estimate 0. How well is your 
algorithm working? 


Fig. 11.9 A simple hidden X Y 
Markov chain l-a ^ ü 
l-e 
a 0 
€ 
a a 
€ 
b 1 
~~ 1 -£ 
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Topics: Stochastic Gradient, Matching Pursuit, Compressed Sensing, Recom- 
mendation Systems 


12.1 Online Linear Regression 


This section explains the stochastic gradient descent algorithm, which is a technique 
used in many learning schemes. 

Recall that a linear regression finds the parameters a and b that minimize the 
error 


K 
0G — a — bY), 


k=1 


where the (X;,Y4) are observed samples that are i.i.d. with some unknown 
distribution fy y (x, y). 

Assume that, instead of calculating the linear regression based on K samples, we 
keep updating the parameters (a, b) every time we observe a new sample. 

Our goal is to find a and b that minimize 


E (x -a- byy) 
=E (x?) La? E (v?) — 2a E(X) — 2bE(XY) + 2abE(Y) 


=: h(a, D). 
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One idea is to use a gradient descent algorithm to minimize h(a, b). Say that at 
step k of the algorithm, one has calculated (a(k), b(k)). The gradient algorithm 
would update (a(k), b(k)) in the direction opposite of the gradient, to make 
h(a(k), b(k)) decrease. That is, the algorithm would compute 


see a= a nab, b(k)) 
b(k + 1) = b(k) — a ha, b(k)), 


where o is a small positive number that controls the step size. Thus, 


alk + 1) = a(k) — o[2a(k) — 2E(X) + 2b(K) E(Y)] 
b(k + 1) = b(k) — a[2b(k) E(Y?) — 2E(XY) + 2a(K) E(Y)]. 


However, we do not know the distributions and cannot compute the expected 
values. Instead, we replace the mean values by the values of the new samples. That 
is, we compute 


alk + 1) = a(k) — a[2a(k) — 2X (k + 1) + 2b()Y (k + 1)] 
b(k + 1) = b(k) — o[2b () Y? (k + 1) 
— 2X (k 4- DY (k 4 1) 4 2a(K)Y (k + 1)]. 


That is, instead of using the gradient algorithm we use a stochastic gradient 
algorithm where the gradient is replaced by a noisy version. The intuition is that, 
if the step size is small, the errors between the true gradient and its noisy version 
average out. 

The top part of Fig. 12.1 shows the updates of this algorithm for the example 
(9.4) with a = 0.002, E(X?) = 1, and E(Z?) = 0.3. In this example, we know that 
the LLSE is 


1 
LIXIY] = a + bY = £5 Y =0.77Y. 


The figure shows that (ax, by) approaches (0, 0.77). 

The bottom part of Fig. 12.1 shows the coefficients for (9.5) with y = 0.05, a = 
1, and 6 = 6. We see that (ag, by) approaches (—1, 7), which are the values for the 
LLSE. 
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Fig. 12.1 The coefficients 
"learned" with a stochastic 
gradient algorithm for (9.4) 
(top) and (9.5) (bottom) 
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12.2 Theory of Stochastic Gradient Projection! 


In this section, we explain the theory of the stochastic gradient algorithm that 
we illustrated in the case of online regression. We start with a discussion of the 
deterministic gradient projection algorithm. 

Consider a smooth convex function on a convex set, such as a soup bowl. A 
standard algorithm to minimize that function, i.e., to find the bottom of the bowl, 
is the gradient projection algorithm. This algorithm is similar to going downhill by 
making smaller and smaller jumps along the steepest slope. The projection makes 
sure that one remains in the acceptable set. The step size of the algorithm decreases 
over time so that one does not keep on overshooting the minimum. 


l This algorithm is also called ‘stochastic gradient descent’. 
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The stochastic gradient projection algorithm is similar except that one has access 
only to a noisy version of the gradient. As the step size gets small, the errors in 
the gradient tend to average out and the algorithm converges to the minimum of the 
function. 

We first review the gradient projection algorithm and then discuss the stochastic 
gradient projection algorithm. 


12.2.1 Gradient Projection 


Consider the problem of minimizing a convex differentiable function f (X) on a 
closed convex subset @ of 97. By definition, @ is a convex set if 


0x 4- (1 — 0O)y € @, Vx, y € C and 0 c (0, 1). (12.1) 


That is, @ contains the line segment between any two of its points. That is, there 
are no holes or kinks in the set boundary (Fig. 12.2). 

Also (see Fig. 12.3), recall that a function f : @ — % is a convex function if 
(Fig. 12.3) 


f(x-- (1—0)y) x 6fG) -- (120) f(y), Vx, y EG ando € (0,1. (12.2) 
A standard algorithm is gradient projection (GP): 


Xn+1 = [Xn — o V f (X5)kg, forn > 0. 


Fig. 12.2 A non-convex set 
(left) and a convex set (right) 


Fig. 12.3 A non-convex 
function (top) and a convex 
function (bottom) 
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Fig. 12.4 The gradient 3 
projection algorithm (12.4) 
and (12.5) 


FELELEK, 


Here, 
0 [7 í 
V f(x) = PX vts z^] 
XI əxa 


is the gradient of f (-) at x and [y]¢ indicates the closest point to y in @, also called 
the projection of y onto @. The constants a, > 0 are called the step sizes of the 
algorithm. 

As a simple example, let f (x) = 6(x — 0.2)? for x € € :— [0, 1]. The factor 6 is 
there only to have big steps initially and show the necessity of projecting back into 
the convex set. With o, = 1/n and xo = 0, the algorithm is 


12 
Xn41 = E — — (x, — o2) ; (12.3) 
n E 
Equivalently, 
12 
Yn+1 = Xn — Vs — 02) (12.4) 
Xn+1 = max(0, min(l, yn+1}} (12.5) 
with yo = xo. 


As the Fig. 12.4 shows, when the step size is large, the update yn+1 falls outside 
the set @ and it is projected back into that set. Eventually, the updates fall into the 
set @. 

There are many known sufficient conditions that guarantee that the algorithm 
converges to the unique minimizer of f (-) on @. Here is an example. 
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Theorem 12.1 Assume that f (x) is convex and differentiable on the convex set € 
and such 


f (x) has a unique minimizer x* in € (12.6) 

IV f(x)? x K, Vx e (12.7) 

X an = oo and 5o < oc. (12.8) 
n n 


Then 
Xn > x* asn > oo. 
E 


Proof The idea of the proof is as follows. Let d, = 1| xn — x*||?. Fix e > 0. One 
shows that there is some no(e) so that, when n > no(e), 


dn+1 < dn — ya, if dn = € (12.9) 
dn+1 < 2e, if d, < €. (12.10) 


Moreover, in (12.9), yn > 0 and $^, y, = oo. 

It follows from (12.9) that, eventually, for some n = n4(e) > no(e), one has 
d, < e. But then, because of (12.9) and (12.10), d, < 2« for all n > n4(e). Since 
€ > Ois arbitrary, this proves that x, — x". 

To show (12.9) and (12.10), we first claim that 


1 
dn+1 < dn + On (x* = Xn)” Vf (Xn) + 5n ER. (12.11) 


To see this, note that 


1 yey) 
dn+1 = zn — anV f (xn)le — x || 
1 
< jllixa — en V f Ga) — x" iP (12.12) 
1 
<d + On (x* — xn)! V f (x) + 59 2K. (12.13) 


The inequality in (12.12) comes from the fact that projection on a convex set is 
non-expansive. That is, 


Ixe — yell < lx — yll. 
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Fig. 12.5 Projection on a 
convex set is non-expansive 


Fig. 12.6 The inequality A 


(12.14) fG0-— fe) 


This property is clear from a picture (see Fig. 12.5) and is not difficult to prove. 
Observe that o, — 0, because >>, a? « co. Hence, (12.13) and (12.7) imply 
(12.10). 
It remains to show (12.9). As Fig. 12.6 shows, the convexity of f (-) implies that 


(x*—x) vro < f- fa). (12.14) 


whenever d, > €, one has 


Also, if d, > e, one has f(x*) — f(xn) < —ó(e), for some 5(€) > 0. Thus, 


(x* — xn)? V f (x4) x —8(€). 


Together with (12.11), this implies 
| 2 
dn+1 < dn = o ó(€) + zo,K. 
Now, let 


1 
Yn = Ond(€) — sunk. (12.15) 
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Since a, — O0, there is some n2(e) such that y, > O for n > n2(e). Moreover, 
(12.8) is seen to imply that ? ^, yn = oo. This proves (12.9) after replacing no(e) by 
max{no(€), n» (e)). Oo 


12.2.2 Stochastic Gradient Projection 


There are many situations where one cannot measure directly the gradient V f (xn) 
of the function. Instead, one has access to a random estimate of that gradient, 
V f (Xn) + Nn, where nn is a random variable. One hopes that, if the error n, is small 
enough, GP still converges to x* when one uses V f (Xn) + nn instead of V f (Xn). 
The point of this section is to justify this hope. 

The algorithm is as follows (see Fig. 12.7): 


Xn+1 = [Xn — Ongnle , (12.16) 
where 
Bn = Vf (Xn) + Zn + bn (12.17) 


is a noisy estimate of the gradient. In (12.17), zn is a zero-mean random variable that 
models the estimation noise and b; is a constant that models the estimation bias. 

As a simple example, let f(x) = 6(x — 0.2? for x € € :— [0, 1]. With æ, = 
1/n, bn = 0, and xo = 0, the algorithm is 


12 
Xni] = E = z — 0.2 + z) ; (12.18) 


A 


In this expression, the z, are i.i.d. U [—0.5, 0.5]. Figure 12.8 shows the values that 
the algorithm produces. 


Fig. 12.7 The figure shows 
level curves of f (-) and the 
convex set @. It also shows 
the first few iterations of GPA 
in red and of SGPA in blue 
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Fig. 12.8 The stochastic 
gradient projection algorithm | 
(12.18) 
| } 4 a 
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This algorithm converges to the minimum x* = 0.2 of the function, albeit slowly. 
For the algorithm (12.16) and (12.17) to converge, one needs the estimation noise 


Zn and bias b, to be small. Specifically, one has the following result. 
Theorem 12.2 Assume that @ is bounded and 


f C) has a unique minimizer x* in €; 


IVF)? < K, Vx € €; 


2 
an 5D. du =, dr < o. 
n n 


In addition, assume that 


oo 
S anllbnl] < oo; 
n=0 


ElZn+1 | 20, Zi... Zn] = 0; 


E(\lznll?) € A, n > 0. 


Then x, — x* with probability one. 


Proof The proof is essentially the same as for the deterministic case. 


The inequality (12.11) becomes 


1 
dn+1 < dn + os (x* — Xn) [V f (Xn) + Zn + ba] + ges K. 


(12.19) 
(12.20) 


(12.21) 


(12.22) 


(12.23) 
(12.24) 


(12.25) 
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Accordingly, yn in (12.15) is replaced by 
1 
y, = ay [9€ + (X* — Xn)" @n +n) | - 503K. (12.26) 


Now, (12.23) implies that vn :— Y^. o &mZm is a martingale.” Because of (12.24) 
and (12.21), one has E(|lvsl2) <A bod Am < co for all n. This implies, by 
the Martingale Convergence Theorem 12.3, that v, converges to a finite random 
variable. Combining this fact with (12.22) shows that? Y% | o, [z» + bs] > 
0. Since ||x, — x*|| is bounded, this implies that the effect of the estimation 
error is asymptotically negligible and that argument used in the proof of GP 
applies here. o 


12.2.3 Martingale Convergence 


We discuss the theory of martingales in Sect. 15.9. Here are the ideas we needed in 
the proof of Theorem 12.2. 

Let {xn, yn, n > 0] be random variables such that E (xn) is well-defined for all 
n. The sequence x; is said to be a martingale with respect to {(Xm, Ym), m > 0} if 


E[Xn1lXm, Ym, m < n] = xn, Vn. 
Theorem 12.3 (Martingale Convergence Theorem) Zf a martingale x, is such 
that E(x2) < B < œ for all n, then it converges with probability one to a finite 
random variable. 


For a proof, see Theorem 15.13. 


12.3 Big Data 


The web makes it easy to collect a vast amount of data from many sources. Examples 
include books, movie, and restaurants that people like, website that they visit, their 
mobility patterns, their medical history, and measurements from sensors. This data 


2See the next section. 


3Recall that if a series X, Wn converges, then the tail $- „>, Wm of the series converges to zero as 
n — oo. 
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Fig. 12.9 The web provides 
access to vast amounts of 
data. How does one extract 
useful knowledge from that 
data? 


can be useful to recommend items that people will probably like, treatments that 
are likely to be effective, people you might want to commute with, to discover 
who talks to who, efficient management techniques, and so on. Moreover, new 
technologies for storage, databases, and cloud computing make it possible to process 
huge amounts of data. This section explains a few of the formulations of such 
problems and algorithms to solve them (Fig. 12.9). 


12.3.1 Relevant Data 


Many factors potentially affect an outcome, but what are the most relevant ones? 
For instance, the success in college of a student is correlated with her high-school 
GPA, her scores in advanced placement courses and standardized tests. How does 
one discover the factors that best predict her success? A similar situation occurs for 
predicting the odds of getting a particular disease, the likelihood of success of a 
medical treatment, and many other applications. 

Identifying these important factors can be most useful to improve outcomes. For 
instance, if one discovers that the odds of success in college are most affected by the 
number of books that a student has to read in high-school and by the number of hours 
she spends playing computer games, then one may be able to suggest strategies for 
improving the odds of success. 

One formulation of the problem is that the outcome Y is correlated with a 
collection of factors that we represent by a vector X with N >> 1 components. 
For instance, if Y is the GPA after 4 years in college, the first component X, of 
X might indicate the high-school GPA, the second component X» the score on a 
specific standardized test, X3 the number of books the student had to write reports 
on, and so on. Intuition suggests that, although N > 1, only relatively few of the 
components of X really affect the outcome Y in a significant way. However, we do 
not want to presume that we know what these components are. 

Say that you want to predict Y on the basis of six components of X. Which 
ones should you consider? This problem turns out to be hard because there are 
many (about N®/6!) subsets with 6 elements in M = {1,2,..., N}, and this 
combinatorial aspect of the problem makes it intractable when N is large. To 
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Fig. 12.10 


make progress, we change the formulation slightly and resort to some heuristic 
(Fig. 12.10). 
The change in formulation is to consider the problem of minimizing 


19 - £[ Dok] 


over b = (bi, ..., by), subject to a bound on 
C(b) = Y lbnl. 
n 


This is called the LASSO problem, for "least absolute shrinkage and selection 
operator." Thus, the hard constraint on the number of components is replaced by a 
cost for using large coefficients. Intuitively, the problem is still qualitatively similar. 
Also, the constraint is such that the solution of the problem has many b, equal to 
zero. Intuitively, if a component is less useful than others, its coefficient is probably 
equal to zero in the solution. 

One interpretation of this problem as follows. In order to simplify the algebra, 
we assume that Y and X are zero-mean. Assume that 


Y = > BaXn dew 
n 


where Z is “M (0, o?) and the coefficients B,, are random and independent with a 
prior distribution of B, given by 


a 
fn(b) = 5 exp —À|blj. 


Then 


^|f you cannot crack a nut, look for another one. (A difference between Engineering and 
Mathematics?) 


123 Big Data 229 


MAP[B|X =x, Y = y] = arg max fp) x,y [bIx, y] 


arg max fa (b) fvixLyIx] 


2 
1 
arg max exp Da ( — 3 2 


x exp -A bal} 


2 
— arg min (>- Xe) t uM [nl 
n n 


with u = 2Ao?. This formulation is the Lagrange multiplier formulation of the 
LASSO problem where the constraint on the cost C(b) is replaced by a penalty 
ILC (b). Thus, the LASSO problem is equivalent to finding M AP[B|X, Y] under 
the assumptions stated above. 

We explain a greedy algorithm that selects the components one by one, trying to 
maximize the progress that it makes with each selection. First assume that we can 
choose only one component X, among the N elements in X. We know that 


cov(Y, Xn) 
L[Y|X,] = —— —ÉÀ —— Xn =: bn Xn 
var(X5) 
and 
2 cov(Y, X,)? 
E((QY — L[Y| X4 D^) = var(Y) - ———— — 
var(X,) 


— var(Y) — |cov(Y, Xn)| x Ibi]. 


Thus, one unit of “cost” C(b,) = |b,| invested in b, brings a reduction |cov(Y, X;)| 
in the objective J (b,). It then makes sense to choose the first component with the 
largest value of “reward per unit cost” |cov(Y, X„)|. Say that this component is X, 
and let Y; = L[Y | X1]. 

Second, assume that we stick to our choice of X, with coefficient b, and that we 
look for a second component X, with n Z 1 to add to our estimate. Note that 


E(Y — bı Xı — b, X4) 
= E((Y — biX1)) — 2b,cov(Y — biXi, Xn) + b2var((X,). 


This expression is minimized over b, by choosing 
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cov(Y — by X1, Xn) 
var( X) 


n = 


and it is then equal to 


cov(Y — bı X], Xn) 
var(Xn) 


E(¥ —biX1))) 


Thus, as before, one unit of additional cost in C(bj, bn) invested in b, brings a 
reduction 


|cov(Y — b1X1, Xn)| 


in the cost J (b, bn). This suggests that the second component X, to pick should be 
the one with the largest covariance with Y — bı X. 

These observations suggest the following algorithm, called the stepwise regres- 
sion algorithm. At each step k, the algorithm finds the component X,, that is 
most correlated with the residual error Y — Y. where Y; is the current estimate. 
Specifically, the algorithm is as follows: 


Step 0: Yo = E(Y) and Sp = Ø; 
Stepk+1: Find n £ S, that maximizes E((Y — Yi) X4) 
Let Sk+1 = Sk U {n}, Yk+1 = L[Y|X,,n € Sg, k = k + 1; 
Repeat until E((Y — Y4y?) x e. 
In practice, one is given a collection of outcomes (Y"', m = 1,..., M} of with 
factors X" = (XT, X5,..., XY). Here, each m corresponds to one sample, say one 
student in the college success example. From those samples, one can estimate the 


mean values by the sample means. Thus, in step k, one has calculated coefficients 
(b1, ..., by) to calculate 


YP = by XT +--+ bX”. 


One then estimates E((Y — Yi) X4) by 


cw 1a yu) x. 


M 


Also, one approximates L[Y|Xn, n € Sk+1] by the linear regression. 

It is useful to note that, by the Law of Large Numbers, the number M of samples 
needed to estimate the means and covariances is not necessarily very large. Thus, 
although one may have data about millions of students, a reasonable estimate may 
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be obtained from a few thousand. Recall that one can use the sample moments to 
compute confidence intervals for these estimates. 

Signal processing uses a similar algorithm called matching pursuit introduced in 
Mallat and Zhang (1993). In that context, the problem is to find a compact represen- 
tation of a signal, such as a picture or a sound. One considers a representation of the 
signal as a linear combination of basis functions. The matching pursuit algorithm 
finds the most important basis functions to use in the representation. 


An Example 
Our example is very small, so that we can understand the steps. We assume that all 
the random variables are zero-mean and that N = 3 with 


4322 
3422 
2241 
2214 


where Z’ = (Y, X1, X5, Xa) = (Y, X’). 


We first try the stepwise regression. The component X, most correlated with Y 
is X4. Thus, 


A cov(Y, X4) 3 
Y; = L[Y|X1] = Su) X= 441 =: bı Xı. 


The next step is to compute the correlations E(X,(Y — Y 1)) for n = 2, 3. We find 


E(X»5(Y — Y)) = E(X2(Y —b41X41)) 22—2b, = 0.5 


E(Xa(Y — Y)) = E(Xa3(Y — bı Xı)) = 2 — 2b, = 0.5. 


Hence, the algorithm selects X» as the next components and one finds 


-1 
^ 42 X 2 1 
Y; = L[Y|X1, X] = [3 ur MEI 


The resulting error variance is 


E (v » by) - 2 
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Fig. 12.11 A complex 4m — r À E 
looking signal that is the sum || | | | | 
of three sine waves 3 | | la | | 
| 
2 i | MEN || | | | 
ATMA N I 
| || | JN | || | | y | 
ot \ | | | i | | 
AA EL 
| n | | | yl] 
E | | B | Ji) Uf i | 
'| || | | | || | | | 
E | | » | | | || | 
Lo [d db, | 
749 200 400 600 800 1000 


12.3.2 Compressed Sensing 


Complex looking objects may have a simple hidden structure. For example, the 
signal s(t) shown in Fig. 12.11 is the sum of three sine waves. That is, 


3 
s(t) = Y bisinQzóit), t > 0. (12.27) 


i=l 


A classical result, called the Nyquist sampling theorem, states that one can 
reconstruct a signal exactly from its values measured every T seconds, provided that 
1/T is at least twice the largest frequency in the signal. According to that result, we 
could reconstruct s(t) by specifying its value every T seconds if T < 1/(2¢;) for 
i = 1,2, 3. However, in the case of (12.27), one can describe s(t) completely by 
specifying the values of the six parameters {b;, $j, i = 1, 2, 3}. Also, it seems clear 
in this particular case that one does not need to know many sample values s(t,) 
for different times tg to be able to reconstruct the six parameters and therefore the 
signal s(t) for all t > 0. Moreover, one expects the reconstruction to be unique if 
we choose a few sampling times tg randomly. The same is true if the representation 
is in terms of different functions, such as polynomials or wavelets. 

This example suggests that if a signal has a simple representation in terms of 
some basis functions (e.g., sine waves), then it is possible to reconstruct it exactly 
from a small number of samples. 

Computing the parameters of (12.27) from a number of samples s(t) is highly 
nontrivial, so that the fact that it is possible does not seem very useful. However, a 
slightly different perspective shows that the problem can be solved. Assume that we 
have a collection of functions (Fig. 12.12) 


gn(t) = sinQz fat), t > 0,n=1,..., N. 
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Fig. 12.12 A tough nut to 
crack! 


Assume also that the frequencies ($1, 62, 3} in s(t) are in the collection { fn, n = 
1,..., N}. We can then try to find the vector a = (a5, n = 1,..., N} such that 


N 
Si) = > angn (tk), fork =1,..., K. 


n=1 


We should be able to do this with three functions, by choosing the appropriate 
coefficients. How do we do this systematically? A first idea is to formulate the 
following problem: 


Minimize l{an 4 0} 
n 


such that s(t) = 3C fork 2 1,..., K. 


n 


That is, one tries to find the most economical representation of s(t) as a linear 
combination of functions in the collection. 

Unfortunately, this problem is intractable because of the number of choices of 
sets of nonzero coefficients a, a difficulty we already faced in the previous section. 
The key trick is, as before, to convert the problem into a much easier one that retains 
the main goal. 

The new problem is as follows: 


Minimize » [as 


n 


such that s(fy) = 9 angn(te), fork =1,..., K. 
n 


(12.28) 


Trying to minimize the sum of the absolute values of the coefficients a, is a 
relaxation of limiting the number of nonzero coefficients. (Simple examples show 
that choosing 5^, |a, |? instead of >>, lan| often leads to bad reconstructions.) The 
result is that if K is large enough, then the solution is exact with a high probability. 
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Theorem 12.4 (Exact Recovery from Random Samples) 77e signal s(t) can be 
recovered exactly with a very high probability from K samples by solving (12.28) if 


K > C x Bx log(N). 


In this expression, C is a small constant, B is the number of sine waves that make 
up s(t), and N is the number of sine waves in the collection. 


Note that this is a probabilistic statement. Indeed, one could be unlucky and 
choose sampling times fy, where s(t) = O (see Fig. 12.11) and these samples 
would not enable the reconstruction of s(t). More generally, the samples could be 
chosen so that they do not enable an exact reconstruction. The theorem says that the 
probability of poor samples is very small. 

Thus, in our example, where B = 3, one can expect to recover the signal s(t) 
exactly from about 31og(100) ~ 14 samples if N < 100. 

Problem (12.28) is equivalent to the following linear programming problem, 
which implies that it is easy to solve: 


Minimize ) by 
n 


such that s(t) = X an gn (te), fork=1,...,K 


n 


and — b, < a, < bn, forn = 1,..., N. (12.29) 


Assume that 
s(t) = sin(2zt) + 2sin(2.4zt) + 3 sin(3.27t),t € [O, 1]. (12.30) 


The frequencies in s(t) are $1 = 1, d2 = 1.2, and $3 = 1.6. The collection of 
functions is 


{gn(t) = sin(2z fat), n =1,..., 100}, 
where f, = n/10. 
The frequencies of the sine waves in the collection are 0.1, 0.2, ..., 10. Thus, 


the frequencies in s(t) are contained in the collection, so that perfect reconstruction 
is possible as 


s(t) — X angn(t) 


12.3 Big Data 235 


with ajo = 1, aj? = 2, and a16 = 3, and all the other coefficients a, equal to zero. 
The theory tells us that reconstruction should be possible with about 14 samples. We 
choose 15 sampling times tg randomly and uniformly in [0, 1]. We then ask Python 
to solve (12.29). The solution is shown in Fig. 12.13. 


Another Example 

Figure 12.14, from Candes and Romberg (2007), shows another example. The image 
on top has about one million pixels. However, it can be represented as a linear 
combination of 25,000 functions called wavelets. Thus, the compressed sensing 
results tell us that one should be able to reconstruct the picture exactly from a small 


Fig. 12.13 Exact 
reconstruction of the signal 
(12.30) with 15 samples 
chosen uniformly in [0, 1]. 
The signal is in green and the 
reconstruction in blue 


Fig. 12.14 Original image 
with 10° pixels (top) and 
reconstruction from 96,000 
randomly chosen pixels 
(bottom) 
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multiple of 25,000 randomly chosen pixels. It turns out that this is indeed the case 
with about 96,000 pixels. 


12.3.3 Recommendation Systems 


Which movie would you like to watch? One formulation of the problem is as 
follows. There is a K x N matrix Y. The entry Y(k, n) of the matrix indicates 
how much user k likes movie n. However, one does not get to observe the complete 
matrix. Instead, one observes a number of entries, when users actually watch movies 
and one gets to record their rankings. The problem is to complete the matrix to be 
able to recommend movies to users. 

This matrix completion is based on the idea that the entries of the matrix are 
not independent. For instance, assume that Bob and Alice have seen the same five 
movies and gave them the same ranking. Assume that Bob has seen another movie 
he loved. Chances are that Alice would also like it. 

To formulate this dependency of the entries of the matrix Y, one observes that 
even though there are thousands of movies, a few factors govern how much users 
like them. Thus, it is reasonable to expect that many columns of the matrix are 
combinations of a few common vectors that correspond to the hidden factors that 
influence the rankings by users. Thus, a few independent vectors get combined 
into linear combinations that form the columns. Consequently the matrix Y has a 
small number of linearly independent columns, i.e., it is a low rank matrix. This 
observation leads to the question of whether one can recover a low rank matrix Y 
from observed entries? 

One possible formulation is 


Minimize rank(X) s.t. X (k,n) = M(k,n), V(k, n) € Q. 


Here, (M (k, n), (k,n) € 92} is the set of observed entries of the matrix. Thus, one 
wishes to find the lowest-rank matrix X that is consistent with the observed entries. 

As before, such a problem is hard. To simplify the problem, one replaces the rank 
by the nuclear norm ||X ||,, where 


(k= og 
i 


where the o; are the singular values of the matrix X. The rank of the matrix counts 
the number of nonzero singular values. The nuclear norm is a convex function of 
the entries of the matrix, which makes the problem a convex programming problem 
that is easy to solve. Remarkably, as in the case of compressed sensing, the solution 
of the modified problem is very good. 


5The rank of a matrix is the number of linearly independent columns. 
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Theorem 12.5 (Exact Matrix Completion from Random Entries) The solution 
of the problem 


Minimize ||X||. s.t. X(k, n) = M(k,n), V(k,n) e 2 


is the matrix Y with a very high probability if the observed entries are chosen 
uniformly at random and if there are at least 


Cn! log(n) 


observations. In this expression, C is a small constant, n = max{K, N}, and r is 
the rank of Y. 


This result is useful in many situations where this number of required observa- 
tions is much smaller than K x N, which is the number of entries of Y. The reference 
contains many extensions of these results and details on numerical solutions. 


12.4 Deep Neural Networks 


Deep neural networks (DNN) are electronic processing circuits inspired by the 
structure of the brain. For instance, our vision system consists of layers. The first 
layer is in the retina that captures the intensity and color of zones in our field of 
vision. The next layer extracts edges and motion. The brain receives these signals 
and extracts higher level features. A simplistic model of this processing is that the 
neurons are arranged in successive layers, where each neuron in one layer gets 
inputs from neurons in the previous layer through connections called synapses. 
Presumably, the weights of these connections get tuned as we grow up and learn 
to perform tasks, possibly by trial and errors. The figure sketches a DNN. The 
inputs at the left of the DNN are the features X from which the system produces 
the probability that X corresponds to a dog, or the estimate of some quantity 
(Fig. 12.15). 


Fig. 12.15 A neural network 
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Fig. 12.16 The logistic 
function 


Each circle is a circuit that we call a neuron. In the figure, zy is the output of 
neuron Ñ. It is multiplied by 6; to contribute the quantity 0; z, to the total input V; of 
neuron /. The parameter 0; represents the strength of the connection between neuron 
k and neuron /. Thus, V; — s OnZn, Where the sum is over all the neurons n of the 
layer to the immediate left of neuron /, including neuron k. The output z; of neuron 
lis equal to f (aj, Vj), where a; is a parameter specific to that neuron and f is some 
function that we discuss later. 

With this structure, it is easy to compute the derivative of some output Z with 
respect to some weight, say 0,. We do it in the last section of this chapter. 

What should be the functions f (a, V)? Inspired by the idea that a neuron fires if 
it is excited enough, one may use a function f (a, V) that is close to 1 if V > a and 
close to —1 if V < a. To make the function differentiable, one may use f(a, V) = 
g(V — a) with 


go) = =, 


1 + e-Pv 
where f is a positive constant. If f is large, then e^?" goes from a very large to a 
very small value when v goes from negative to positive. Consequently, g(v) goes 
from —1 to +1 (Fig. 12.16). 

The DNN is able to model many functions by adjusting its parameters. To see 
why, consider neuron /. The output of this neuron indicates whether the linear 
combination V; = ae 0,z, is larger or smaller than the thresholds a; of the 
neurons. Consequently, the first layer divides the set of inputs into regions separated 
by hyperplanes. The next layer then further divides these regions. The number of 
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regions that can be obtained by this process is exponential in the number of layers. 
The final layer then assigns values to the regions, thus approximating a complex 
function of the input vector by an almost piecewise constant function. 

The missing piece of the puzzle is that, unfortunately, the cost function is not a 
nice convex function of the parameters of the DNN. Instead, it typically has many 
local minima. Consequently, by using the SGD algorithm, the tuning of the DNN 
may get stuck in a local minimum. Also, to reduce the number of parameters to 
tune, one usually selects a few layers with fixed parameters, such as edge detectors 
in vision systems. Thus, the selection of the DNN becomes somewhat of an art, like 
cooking. 

Thus, it remains impossible to predict whether the DNN will be a good technique 
for machine learning in a specific application. The answer of the practitioners is to 
try and see. If it works, they publish a paper. We are far from the proven convergence 
results of adaptive systems. Ah, nostalgia. ... 

There is a worrisome aspect to these black-box approaches. When the DNN 
has been tuned and seems to perform well on many trials, not only one does 
not understand what it really does, but one has no guarantee that it will not 
seriously misbehave for some inputs. Imagine then a killer drone with a DNN target 
recognition system. ... It is not surprising that a number of serious scientists have 
raised concerns about "artificial stupidity" and the need to build safeguards into such 
systems. “Open the pod bay doors, Hal.” 


12.4.1 Calculating Derivatives 


Let's compute the derivative of Z with respect to Ox. 

See you increase 0, by e. This increases V; by ezg. In turn, this increases 
z by dz :— ezy f'(aj, Vi), where f'(aj, Vi) is the derivative of f(a, Vi) with 
respect to V;. Consequently, this increases Vm by 6józ;. The result is an increase 
of zm by óz,, = 521 f' (ay, Vm). Finally, this increase V, by 0,,6z,, and Z by 
On 5Zm f (ay, V,). We conclude that 


ee tour VO f' Vm)Om f" Cars V, 
de T 2T (ai, Vi) If (am, Vm) mf (ar, r). 


The details do not matter too much. The point is that the structure of the network 
makes the calculation of the derivatives straightforward. 
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12.5 Summary 


* Online Linear Regression; 

* Convex Sets and Functions; 

* Gradient Projection Algorithm; 

* Stochastic Gradient Projection Algorithm; 

* Deep Neural Networks; 

* Martingale Convergence Theorem; 

* Big Data: Relevant Data, Compressed Sensing, Recommendation Systems. 


12.5.1 Key Equations and Formulas 


Convex Set if it contains its chords (12.1) 
Convex Function if it is above its tangents (12.2) 
Convergence of GP if unique minimizer and bounded gradient T.12.1 
Convergence of SGP if bounded drift and noise variance T.12.2 
Martingale CT L! or L?-bounded MG converges w.p. 1 T.12.3 


12.6 References 


Online linear regression algorithms are discussed in Strehl and Littman (2007). 
The book Bertsekas and Tsitsiklis (1989) is an excellent presentation of dis- 
tributed optimization algorithms. It explains the gradient projection algorithm and 
distributed implementations. The LASSO algorithm and many other methods are 
clearly explained in Hastie et al. (2009), together with applications. The theory of 
martingales is nicely presented by its father in Doob (1953). Theorem 12.4 is from 
Candes and Romberg (2007). 


12.7 Problems 


Problem 12.1 Let (Y;, n > 1} be i.i.d. U[O, 1] random variables and (Z;, n > 1} 
be i.i.d. /// (0, 1) random variables. Define X, = 1(Y, > aj -- Z, for some constant 


a. The goal of the problem is to design an algorithm that “learns” the value of a 
from the observation of pairs (Xn, Yn). We construct a model 
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Fig. 12.17 The logistic 1 — 
function (12.31) with A = 10 


Xn = g(Ys = 0), 
where 


1 
g(u) = EEE (12.31) 


with à = 10. Note that when u > 0, the denominator of g(u) is close to 1, so 
that g(u) © 1. Also, when u < 0, the denominator is large and g(u) 7 0. Thus, 


g(u) & l{u > 0}. The function g(-) is called the logistic function. Use SGD in 
Python to estimate 0 (Fig. 12.17). 


Problem 12.2 Implement the stepwise regression algorithm with 


where Z/ = (Y, X1, X2, X3) = (Y, X’). 
Problem 12.3 Implement the compressed sensing algorithm with 
s(t) = 3sin(2zt) + 2sin(3zt) + 4sin(4zt), t € [0, 1], 
where you choose sampling times tg independently and uniformly in [0, 1]. Assume 


that the collection of sine waves has the frequencies (0.1, 0.2, ..., 3}. 
What is the minimum number of samples that you need for exact reconstruction? 
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(R) 


Check for 
updates 


Application: Choosing a fast route given uncertain delays, Controlling a 
Markov chain 
Topics: Stochastic Dynamic Programming, Markov Decision Problems 


13.1 Model 


One is given a finite connected directed graph. Each edge (i, j) is associated with a 
travel time T (i, j). The travel times are independent and have known distributions. 
There are a start node s and a destination node d. The goal is to choose a fast route 
from s to d. We consider a few different formulations (Fig. 13.1). 

To make the situation concrete, we consider the very simple example illustrated 
in Fig. 132. 

The goal is to choose the fastest path from s to d. In this example, the possible 
paths are sd, sad, and sabd. We assume that the delays T (i, j) on the edges (i, j) 
are as follows: 


T (s, a) =p U[5, 13], T (a, d) = 10, T (a, b) =p U[2, 10], 
T (b, d) = 4, T (s, d) = 20. 
Thus, the delay from s to a is uniformly distributed in [5, 13], the delay from a to 


d is equal to 10, and so on. The delays are assumed to be independent, which is an 
unrealistic simplification. 
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Fig. 13.1 Road network. k ( 3 NULLE z Pittsburg 
How to select a path? 


Walnut Clayton 
Sita Pen 
N o 


Fig. 13.2 A simple graph 


13.2 Formulation 1: Pre-planning 


In this formulation, one does not observe anything and one plans the journey ahead 
of time. In this case, the solution is to look at the average travel times E(T (i, j)) — 
c(i, j) and to run a shortest path algorithm. 

For our example, the average delays are c(s, a) — 9, c(a, d) — 10, and so on, as 
shown in the top part of Fig. 13.3. 

Let V (i) be the minimum average travel time from node i to the destination d. 
The Bellman—Ford Algorithm calculates these values as follows. Let V, (i) be an 
estimate of the shortest average travel time from i to d, as calculated after the n-th 
iteration of the algorithm. The algorithm starts with Vo(d) = 0 and Vo(i) = co for 
i Z d. Then, the algorithm calculates 


Vii) = ui J) + Va). n z 0. (13.1) 


The interpretation is that V4 (i) is the minimum expected travel time from i to d 
over all paths that go through at most n edges. The distance is infinite if no path 
with at most n edges reaches the destination d. This is exactly the same algorithm 
we discussed in Sect. 11.2 to develop the Viterbi algorithm. 

These relations are justified by the fact that the mean value of a sum is the sum of 
the mean values. For instance, say that the minimum average travel time from a to 
d using a path that has at most 2 edges is V» (a, d) and it corresponds to a path with 
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Fig. 13.3 The average a 
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random travel time W2(a, d). Then, the minimum average travel time from s to d 
using a path that has at most 3 edges follows either the direct path sd, that has travel 
time T (s, d), or the edge sa followed by the fastest path from a to d that uses at most 
2 edges with travel time W2(a, d). Accordingly, the minimum expected travel time 
V3(s) from s to d using at most three edges is the minimum of E (T (s, d)) = c(s, d) 
and the mean value of T (s, a) + W»(a, d). Thus, 


V3(s) = min{c(s, d), E(T (s, a) + Wo(a,d))) 
= min{c(s, d), c(s, a) + Vo(a, d)). 


Since the graph is finite, V, converges to V in at most N steps, where N is the 
length of the longest cycle-free path to node d. The limit is such that V (i) is the 
shortest average travel time from i to d. Note that V satisfies the following fixed- 
point equations: 


V) = min(cG, j) + V(j)), Vj and V(d) = 0. (13.2) 
J 


These are called the dynamic programming equations (DPE). Thus, (13.1) is an 
algorithm for solving (13.2). 


13.3 Formulation 2: Adapting 


We now assume that when we get to a node i, we see the actual travel times along 
the edges out of i. However, we do not see beyond those edges. How should we 
modify our path planning? If the travel times are in fact deterministic, then nothing 
changes. However, if they are random, we may notice that the actual travel times on 
some edges out of i are smaller than their mean value, whereas others may be larger. 
Clearly, we should use that information. 
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Here is a systematic procedure for calculating the best path. Let V (i) be the 
minimum average time to get to d starting from node i, for i € (s, a, b, d}. We see 
that V(b) = T (b, d) = 4. 

To calculate V (a), define W(a) to be the minimum expected time from a to d 
given the observed delays along the edges out of a. That is, 


W (a) = min(T (a, b) + V (b), T (a, d)). 
Hence, V(a) = E(W(a)). Thus, 
V (a) = E(min(T (a, b) + V(b), T (a, d))). (13.3) 


For this example, we see that 7 (a, b) + V(b) =p U[6, 14]. Since T (a, d) = 10, if 
T (a, b) + V(b) « 10, which occurs with probability 1/2, we choose the path abd 
that has a travel time uniformly distributed in [6, 10] with a mean value 8. Also, 
if T (a, b) + V(b) > 10, then we choose the travel time T (a, d) = 10, also with 
probability 1/2. Thus, the minimum expected travel time V (a) from a to d is equal 
to 8 with probability 1/2 and to 10 with probability 1/2, so that its average value is 
8(1/2) + 10(1/2) = 9. Hence, V (a) = 9. 
Similarly, 


V(s) = E(min(T (s, a) + V (a), T (s, d))), 


where T (s, a) + V (a) =p U[14, 22] and T (s, d) = 20. Thus, if T (s, a) + V (a) < 
20, which occurs with probability (20 — 14)/(22 — 14) — 3/4, then we choose a 
path that goes from s to a and has a delay that is uniformly distributed in [14, 20], 
with mean value 17. If T (s, a) + V (a) > 20, which occurs with probability 1/4, we 
choose the direct path sd that has delay 20. Hence V(s) = 17(3/4) + 20(1/4) = 
71/4 = 17.75. 

Note that by observing the delays on the next edges and making the appropriate 
decisions, we reduce the expected travel time from s to d from 19 to 17.5. Not 
surprisingly, more information helps. Observe also that the decisions we make 
depend on the observed delays. For instance, starting in node s, we go along edge sd 
if T (s, a) + V(a) > T(s, d), ie, if T(s, a) +9 > 20, or T(s, a) > 11. Otherwise, 
we follow the edge sa. 

Let us now go back to the general model. The key relationships are as follows: 


V@= Euntes Jd VODD. Vi. (13.4) 


The interpretation is simple: starting from i, one can choose to go next to j. In that 
case, one faces a travel time T (i, j) from i to j and a subsequent minimum average 
time from j to d equal to V (j). Since the path from i to d must necessarily go to a 
next node j, the minimum expected travel time from i to d is given by the expression 
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above. As before, these equations are justified by the fact that the expected value of 
a sum is the sum of the expected values. 
An algorithm for solving these fixed-point equations is 


Vn+1 (i) = BREL, D + VGD), n = 0, (13.5) 


where Vo(i) = 0 for all i. The interpretation of V, (i) is the same as before: it is the 
minimum expected time from i to d using a path with at most n edges, given that at 
each step along the path one observes the delays along the edges out of the current 
node. 

Equations (13.4) are the stochastic dynamic programming equations for the 
problem. Equations (13.5) are called the value iteration equations. 


13.4 Markov Decision Problem 


A more general version of the path planning problem is the control of a Markov 
chain. At each step, one looks at the state and one chooses an action that determines 
the transition probabilities and also the cost for the next step. 

More precisely, to define a controlled Markov chain X (n) on some state space 
2, one specifies, for each x € X, a set A(x) of possible actions. For each state 
x € X and each action a € A(x), one has transition probabilities P (x, x’; a) > 0 
with Dex P (x, x'; a) = 1. One also specifies a cost c(x, a) of taking the action 
a when in state x. 

The sequence X (n) is then defined by 


P[X (1) = xi, X2) = x2, ..., X (n) = Xn|X(O) = xo, ao, ..., an-1] 


= P(xo, xy; ag) P(x1, x2; a1) X ++ X P(Xn-1, Xn; an—1). 


The goal is to choose the actions to minimize the average total cost 


n 
E b c(X (m), a(m))| X (0) = J ; (13.6) 
m=0 
For each m = 0,...,n, the action a(m) € A(X(m)) is determined from the 
knowledge of X (m) and also of the previous states X (0), ..., X (m— 1) and previous 
actions a(0), ..., a(m — 1). 


This problem is called a Markov decision problem (MDP). 

To solve this problem, we follow a procedure identical to the path planning 
problem where we think of the state as the node that has been reached during the 
travel. Let Vj, (x) be the minimum value of the cost (13.6) when n is replaced by m. 
That is, V4, (x) is the minimum average cost of the next m + 1 steps, starting from 
X(0) = x. The function V,,(-) is called the value function. 
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The DPE are 


Vin (x) = ae a) + E[Vn—1(%')|X (0) = x, a(0) = a] 
ae x 


= min lc(x,a) +E reae]. (13.7) 
acA(x) v 
Let a = gj; (x) be the value of a € A(x) that achieves the minimum in (13.7). Then 
the choices a(m) = gn—m(X (m)) achieve the minimum of (13.6). 
The existence of the minimizing a in (13.7) is clear if 2° and each A(x) are finite 
and also under weaker assumptions. 


13.4.1 Examples 


Guess a Card 
Here is a simple example. One is given a perfectly shuffled deck of 52 cards. The 
cards are turned over one at a time. Before one turns over a new card, you have the 
option of saying “Stop.” If the next card is an ace, you win $1.00. If not, the game 
stops and you lose. The problem is for you to decide when to stop (Fig. 13.4). 
Assume that there are still x aces in a deck with m remaining cards. Then, if you 
say stop, you win with probability x/m. If you do not say stop, then after the next 
card is turned over, x — 1 aces remain with probability x /m and x remain otherwise. 
Let V (m, x) be the maximum expected probability that you win if there are still 
x aces in the deck with m remaining cards. 


The DPE are 
x x m—x 
Vm, x) = max | »>—V(im—1,x-1)+ Vim zi 
m m 
Interestingly, the solution of these equations is V(m, x) = x/m, as you can 


verify. Also, the two terms in the maximum are equal if x > 0. The conclusion is 
that you can stop at any time, as long as there is still at least one ace in the deck. 


Scheduling Jobs 

You have two sets of jobs to perform. Jobs of type i (for i = 1, 2) have a waiting 
cost equal to c; per unit of waiting time until they are completed. Also, when you 
work on a job of type i, it completes with probability u; in the next time unit, 


Fig. 13.4 Guessing if the 
next card is an Ace 


Q 
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independently of how long you have worked on it. That is, the job processing times 
are geometrically distributed with parameter j1;. The problem is to decide which job 
to work on to minimize the total waiting cost of the jobs. 

Let V (x1, x2) be the minimum expected total remaining waiting cost given that 
there are x; jobs to type 1 and x2 jobs of type 2. The DPE are 


V (x1, x2) = x1c1 + x2c2 + min(Vi (x1, x2), Vox, x2)}, 
where 

Vi G3. x2) = mi V (Qa — DT, x2) + (0. — iu) V Ga x2) 
and 

Vi (ei, x2) = uaV Ga, (2 — 1)*) + (0. — i) V Ga x2). 


As can be verified directly, the solution of the DPE is as follows. Assume that 
C1H1 > C2u2. Then 


+1 +1 
x1 (x1 + 1) 4 o 202 ) 4:88. 
2u| 2u2 Hı 


V(x1, x2) = cy 


Moreover, this minimum expected cost is achieved by performing all the jobs of type 
1 first and then the jobs of type 2. This strategy is called the cy rule. Thus, although 
one might be tempted to work on the longest queue first, this is not optimal. 

There is a simple interchange argument to confirm the optimality of the cy rule. 
Say that you decide to work on the jobs in the following order: 1221211. Thus, you 
work on a job of type 1 until it completes, then a job of type 2, then another job of 
type 2, and so on. Modify the strategy as follows. Instead of working on the second 
job of type 2, work on the second job of type 1, until it completes. Then work on the 
second job of type 2 and continue as you would have. Thus, the processings of two 
jobs have been interchanged: the second job of type 2 and the second job of type 
1. Only the waiting times of these two jobs change. The waiting time of the job of 
type 1 is reduced by 1/4145, on average, since this is the average completion time of 
the job of type 2 that was previously processed before the job of type 1. Thus, the 
waiting cost of the job of type 1 is reduced by c1 /1». Similarly, the waiting cost of 
the job of type 2 is increased by c2/u1, on average. Thus, the average cost decreases 
by ci/u2 — co/p4 Which is a positive amount since c1441 > copo. By induction, it 
is optimal to process all the jobs of type 1 first. 

Of course, there are very few examples of control problems where the optimal 
policy can be proved by a simple argument. Nevertheless, keep this possibility 
in mind because it can yield elegant results simply. For instance, assume that 
jobs arrive at the queues shown in Fig. 13.5 according to independent Bernoulli 
processes. That is, with probability A;, a job of type i arrives during each time step, 
independently of the past, fori = 1, 2. The same interchange argument shows that 
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Fig. 13.5 What job to work 


on next? (11) 
(AII |. 


the cu rule minimizes the long-term average expected waiting cost of the jobs (a 
cost that we have not defined, but you may be able to imagine what it means). 
This is useful because the DPE can no longer be solved explicitly and proving the 
optimality of this rule analytically is quite complicated. 


Hiring a Helper 

Jobs arrive at random times and you must decide whether to work on them yourself 
or hire some helper. Intuition suggests that you should get some help if the backlog 
of jobs to be performed exceeds some threshold. We examine a model of this 
situation. 

At time n = 0, 1, ..., a job arrives with probability à € (0, 1). If you work alone, 
you complete a job with probability u € (0, 1) in one time unit, independently of 
the past. If you hire a helper, then together you complete a job with probability 
æu € (0, 1) in one unit of time, where aw > 1. Let the cost at time n be c(n) = f > 0 
if you hire a helper at time step n and c(n) = 0 otherwise. The goal is to minimize 


N 
E [Law + can) 


n=0 


where X (n) is the number of jobs yet to be processed at time n. This cost measures 
the waiting cost of the jobs plus the cost of hiring the helper. The waiting cost is 
minimized if you hire the helper all the time and the helper cost is minimized if you 
never hire him. The goal of the problem is to figure out when to hire a helper to 
achieve the best trade-off between these two costs. 

The state of the system is X (n) at time n. Let 


Vn (x) = min E » + c(n))IX (0) = J , 


n=0 


where the minimum is over the possible choices of actions (hiring or not) that 
depend on the state up to that time. The stochastic dynamic programming equations 
are 


Vin(x) =x + uu ene =1}+ - A) — A(a)) Vn-1() 
T A(1 — u(a)) Va-1(min[x + 1, Kj) 


t (1 — à)u(a)Vm-1ı(max{x — 1, 0}) 
t Au(a)Va-1(x)), n > 0, 
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where we defined (0) = u and w(1) = ap and V.1(x) = 0. Also, we limit the 
backlog of jobs to K, so that if one job arrives where there are already K , we discard 
the new arrival. 

We solve these equations using Python. As expected, the solution shows that one 
should hire a helper at time n if X(n) > y (N — n), where y (m) is a constant that 
decreases with m. As the time to go m increases, the cost of holding extra jobs 
increases and so does the incentive to hire a helper. Figure 13.6 shows the values of 
y (n) for B = 14 and £ = 20. The figure corresponds to A = 0.5, u = 0.6,a@ = 
1.5, K = 20, and N = 200. Not surprisingly, when the helper is more expensive, 
one waits until the backlog is larger before hiring him. 


Which Queue to Join? 

After shopping in the supermarket, you get to the cashiers and have to choose a 
queue to join. Naturally, you try to identify the queue with the shortest expected 
waiting time, and you join that queue. Everyone does the same, and it seems quite 
natural that this strategy should minimize the expected waiting time of all the 
customers. Your friend, who has taken this class before, tells you that this is not 
necessarily the case. Let us try to understand this apparent paradox. 

Assume that there are two queues and customers arrive with probability A at each 
time step. The service times in queue i are geometrically distributed with parameter 
hi in queue i, fori = 1,2. 

Say that when you arrive, there are x; customers in queue i, fori = 1,2. You 
should join queue 1 if 


xy +1 x2 +1 
< , 
H1 H2 
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Fig. 13.7 The socially 
optimal policy is shown in 
blue and the selfish policy is 
shown in green 


Join queue 1 


as this will minimize the expected time until you are served. However, if we consider 
the problem of minimizing the total average waiting time of customers in the two 
queues, we find that the optimal policy does not agree with the selfish choice of 
individual customers. Figure 13.7 shows an example with u2 < 1. It indicates that 
under the socially optimal policy some customers should join queue 2, even though 
they will then incur a longer delay than under the selfish policy. 

This example corresponds to minimizing the total cost 


N 
E" EGG (0) + X2). 


n=0 


In this expression, X;(n) is the number of customers in queue i at time n. The 
capacity of each queue is K. To prevent the system from discarding too many 
customers, one imposes the constraint that if only one queue is full when a customer 
arrives, he should join the non-full queue. In the expression for the total cost, one 
uses a discount factor 6 € (0, 1) to keep the cost bounded. The figure corresponds 
to K = 8,A = 03, uı = 0.4, u2 = 0.2, N = 100, and B = 0.95. (The graphs are 
in fact for x; + 1 and x? + 2 as Python does not like the index value 0.) 
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13.5 Infinite Horizon 


The problem of minimizing (13.6) involves a finite horizon. The problem stops at 
time n. We have seen that the minimum cost to go when there are m more steps is 
Vin(x) when in state x. Thus, not surprisingly, the cost to go depends on the time to 
go and, consequently, the best action to choose in a given state x generally depends 
on the time to go. 

The problem is simpler when one considers an infinite horizon because the time 
to go remains the same at each step. To make the total cost finite, one discounts the 
future costs. That is, one considers the problem of minimizing the expected total 
discounted cost: 


E bs B" c(X (m), a(m))|X (0) ^ (13.8) 


m=0 


In this expression, 0 < B < 1 is the discount rate. Intuitively, if B is small, then 
future costs do not matter much and one tends to be short-sighted. However, if £ is 
close to 1, then one pays a lot of attention to the long term. 

Define V (x) to be the minimum value of the cost (13.8), where the minimum is 
over all the possible choices of the actions at each step. Arguing as before, one can 
show that 


V(x) = min (eG a) + BELV(X(1))IX(0) = x, a(0) = al} 
= yp, [emer nes avo). (13.9) 


These equations are similar to (13.7), with two differences: the discount factor 
and the fact that the value function does not depend on time. Note that these 
equations are fixed-point equations. A standard method to solve them is to consider 
the equations 


V, = mi , P(x, y; a)V, ,n xm, 13.10 
n+1(X) an, foo 22123 Gr. y; a) o] n (13.10) 
where one chooses Vo(x) = 0, Vx. Note that these equations correspond to 
n 
Va (x) = min E » B" c(X (m), a(m))| X (0) = 1 i (13.11) 
m=0 


One can show that the solution V„(x) of (13.10) is such that V, (x) — V(x) as 
n — oco, where V (x) is the solution of (13.9). 
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13.6 Summary 


* Dynamic Programming Equations; 
* Controlled Markov Chain; 
* Markov Decision Problem. 


13.6.1 Key Equations and Formulas 


MDP P(x, y; a) 8.13.4 
SDPE Vai (x) = ming{e(x, a) + $7, P(x, y; Vin Q0] (13.7) 


13.7 References 


The book Ross (1995) is a splendid introduction to stochastic dynamic program- 
ming. We borrowed the “guess a card” example from it. It explains the key 
ideas simply and the many variations of the theory illustrated by carefully chosen 
examples. The textbook Bertsekas (2005) is a comprehensive presentation of the 
algorithms for dynamic programming. It contains many examples and detailed 
discussions of the theory and practice. 


13.8 Problems 


Problem 13.1 Consider a single queue with one server in discrete time. At each 
time, a new customer arrives to the queue with probability A < 1, and if the server 
works on the queue at rate u € [0, 1], it serves one customer in one unit of time 
with probability u. Due to energy constraints, you want your server to work with 
the smallest rate as possible without making the queue unstable. Thus, you want 
your server to work at rate u* = A. Unfortunately, you do not know the value of 
à. All you can observe is the queue length. We try to design an algorithm based on 
stochastic gradient to learn z* in the following steps: 


(a) Minimize the function V (u) = 10. — n over u using gradient descent. 

(b) Find E[Q(n4- 1) — Q(n)| Q(n) = q], for some g > 0, given that server allocates 
capacity un during time slot n. Q (n) is the queue length at time n. What happens 
ifq =0? 

(c) Use the stochastic gradient projection algorithm and write a Python code based 
on parts (a) and (b) to learn u*. Note that 0 < u < 1. 
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Hint To avoid the case when the queue length is 0, start with a large initial queue 
length. 


Problem 13.2 Consider a routing network with three nodes: the start node s, the 
destination node d, and an intermediate node r. There is a direct path from s to d 
with travel time 20. The travel time from s to r is 7. There are two paths from r to 
d. They have independent travel times that are uniformly distributed between 8 and 
20. 


(a) If you want to do pre-planning, which path should be chosen to go from s to d? 
(b) If the travel times from r to d are revealed at r which path should be chosen? 


Problem 13.3 Consider a single queue in discrete time with Bernoulli arrival 
process of rate A. The queue can hold K jobs, and there is a fee y when its 
backlog reaches K. There is one server dedicated to the queue with service rate 
I4 (0). You can decide to allocate another server to the queue that increases the rate 
to u(1) € (u(0), 1). However, using the additional server has some cost. You want 
to minimize the cost 


oo 


18" EQCQ) eH (n) + y1{X(n) = K}), 
n=0 


where H (n) is equal to one if you use an extra helper at time n and is zero otherwise. 


(a) Write the dynamic programming equations. 
(b) Solve the DPE with MATLAB for A = 0.4, 4(0) = 0.35, (1) = 0.5,a = 
2.5, B = 0.95, and y = 30. 


Problem 13.4 We want to plan routing from node | to 5 in the graph of Fig. 13.8. 
The travel times on the edges of the graph are as follows: T(1, 2) = 2, T(1, 3) ~ 
U[2, 4], T2,4) = 1, T2, 5) ~ U[4, 6], T(4, 5) ~ U[3, 5], and T(3, 5) = 4. Note 
that X ^ U[a, b] means X is a random variable uniformly distributed between a 
and b. 


(a) If you want to do pre-planning, which path would you choose? What is the 
expected travel time? 


Fig. 13.8 Route planning 
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(b) Now suppose that at each node, the travel times of two steps ahead are revealed. 
Thus, at node 1 all the travel times are revealed except T (4,5). Write the 
dynamic programming equations that solve the route planning problem and 
solve them. That is, let V(i) be the minimum expected travel time from i to 
5,and 1 <i < 5. Find V (i) for 1 <i < 5. 


Problem 13.5 Consider a factory, DilBox, that stores boxes. At the beginning of 
year k, they have xg boxes in storage. Now at the end of every year k they are 
mandated by contracts to provide dg boxes. However, the number of boxes dy is 
unknown until the year actually ends. 

At the beginning of the year, they can request ug boxes. Using very shoddy 
Elbonian labor each box has costs A to produce. At the end of the year DilBox 
is able to borrow yg boxes from BoxR’Us at the cost s (yg) to meet the contract. 

The boxes remaining after meeting the demand are carried over to the next year 
Xk+1 = Xk + ug + yk — dy. Sadly, they need to pay to store the boxes at a cost given 
by a function r (x..1). 

Now your job is to provide a box creation and storage plan for the upcoming 20 
years. Your goal is to minimize the total cost for the 20 years. You can treat costs 
as being paid at the end of the year and there is no inflation. Also, you get your 
pension after 20 years so you do not care about costs beyond those paid in the 20th 
year. (Assume you start with zero boxes, of course, it does not really matter). 


(a) Formulate the problem as a Markov decision problem; 
(b) Write the dynamic programming equations; 
(c) Use Python to solve the equations with the following parameters: 


- r(xy) = 5xg; 
— s(yk) = 20yz; 
- A=1; 


— dy =p U{1,..., 10}. 


Problem 13.6 Consider a video game duel where Bob starts at time O at distance 
T — 10 from Alice and gets closer to her at speed 1. For instance, Alice is at location 
(0, 0) in the plane and Bob starts at location (0, 7) and moves toward Alice, so that 
after t seconds, Bob is at location (0, T — t). Alice has picked a random time, 
uniformly distributed in [0, T], when she will shoot Bob. If Alice shoots first, Bob 
is dead. Alice never misses. [This is only a video game.] 


(a) Bob has to find at what time t he should shoot Alice to maximize the probability 
of killing her. If Bob shoots from a distance x, the probability that he hits (and 
kills) Alice is 1/(1 4- x)?. Bob has only one bullet. 

(b) What is the maximum probability that Bob wins the duel? 

(c) Assume now that Bob has two bullets. You must find the times ft; and t? when 
Bob should shoot Alice to maximize the probability that he wins the duel. Again, 
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for each bullet that Bob shoots from distance x, the probability of success is 
1/(1 + x)’, independently for each bullet. 


Problem 13.7 You play a game where you win the amount you bet with probability 
p € (0,0.5) and you lose it with probability 1 — p. Your initial fortune is 16 and 
you gamble a fixed amount y at each step, where y € (1,2,4,8,16]. Find the 
probability that you reach a fortune equal to 256 before you go broke. What is the 
gambling amount that maximizes that probability? 
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Topics: LQG Control, incomplete observations 


14.1 LQG Control 


The ideas of dynamic programming that we explained for a controlled Markov 
chain apply to other controlled systems. We discuss the case of a linear system with 
quadratic cost and Gaussian noise, which is called the LQG problem. For simplicity, 
we consider only the scalar case. 

The system is 


X(n 4-1) 2 aX(n) 4- U(n) 4- V(n),n 7 0. (14.1) 


Here, X (n) is the state, U (n) is a control value, and V (n) is the noise. We assume 
that the random variables V (n) are i.i.d. and N (0, o°). 

The problem is to choose, at each time n, the control value U (n) in Ñ based on 
the observed state values up to time n to minimize the expected cost 


N 
E [x (x(n? + BUG) IX (0) = J (14.2) 


n=0 


Thus, the goal of the control is to keep the state value close to zero, and one pays a 
cost for the control. 

The problem is then to trade-off the cost of a large value of the state and that of 
the control that can bring the state back close to zero. To get some intuition for the 
solution, consider a simple form of this trade-off: minimizing 
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Fig. 14.1 The optimal 
control is linear in the state X (n) 


g(N — n) 


(ax + u)? + Bu’. 


In this simple version of the problem, there is no noise and we apply the control 
only once. To minimize this expression over u, we set the derivative with respect to 
u equal to zero and we find 


2(ax 4- u) + 2Bu = 0, 


so that 
a 
=— X: 
1+ 8 


Thus, the value of the control that minimizes the cost is linear in the state. We should 
use a large control value when the state is far from the desired value 0. The following 
result shows that the same conclusion holds for our problem (Fig. 14.1). 


Theorem 14.1 Optimal LQG Control The control values U (n) that minimize 
(14.2) for the system (14.1) are 


U (n) = g(N — n)X (n), 


where 
ol ad(m — 1) . 
gn) = dle ac p" > 0; (14.3) 
7 a? Bd (m — 1) 
with d(—1) = 0. 


That is, the optimal control is linear in the state and the coefficient depends on 
the time-to-go. These coefficients can be pre-computed at time 0 and they do not 
depend on the noise variance. Thus, the control values would be calculated in the 
same way if V (n) = 0 for all n. 
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Proof Let V,,(x) be the minimum value of (14.2) when N is replaced by m. The 
stochastic dynamic programming equations are 


Vn(x) = min {x? + Bu? + E(Vn—1(ax +u + V] m > 0, (14.5) 
u 


where V = N (0, 07). Also, V_1(x) := 0. 
We claim that the solution of these equations is 


Vin (x) = c(m) + d(m)x? 


for some constants c(m) and d(m) where d (m) satisfies (14.4). 
That is, we claim that 


min(x^-- Bu? + E[c(m — 1) -d(m — D (ax-E ut V)*) =c(m)+d(m)x*, | (14.6) 


where d (m) is given by (14.4) and the minimizer is u = g(m)x where g(m) is given 
by (14.3). 
The verification is a simple algebraic exercise that we leave to the reader. oO 


14.1.1 Letting N — oo 
What happens if N becomes very large in (14.2)? Proceeding formally, we examine 


(14.4) and observe that if |a| < 1, then d(m) — d as m — oo where d is the 
solution of the fixed-point equation 


a? Bd 
d= f(d):—-1-4 : 
f (d) Bad 
To see why this is the case, note that 
292 
a^p 
'(d i EN 
f (d) TETI 


so that 0 < f'(d) < a? for d > 0. Also, f(d) > 0 ford > 0. Hence, f(d) isa 
contraction. That is, 


|f (d1) — f(d2)| < ald, — d2|, Vd), d2 = 0 


for some a € (0, 1). (Here, a = a?.) In particular, choosing dı = d and d? = d (m), 
we find that 


ld — d(m + 1)| < ald — d(m)|, Ym > 0. 
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Fig. 14.2 The optimal 
control for the average cost 


X (n) 


X (n) 
g 


Thus, 
|d — d(m)| x o" |d — d(0)], 


which shows that d (m) — d, as claimed. Consequently, (14.3) shows that g(m) > 
g as m — oo, where 

u ad 

ime. 


Thus, when the time-to-go m is very large, the optimal control approaches U (N — 
m) — gX(N — m). This suggests that this control may minimize the cost (14.2) 
when N tends to infinity (Fig. 14.2). 

The formal way to study this problem is to consider the /ong-term average cost 
defined by 


N 
B cue [x (xin? + bUm?) IX(0) = 1 
n=0 


N—oo N 


This expression is the average cost per unit time. One can show that if |a| < 1, then 
the control U (n) = gX (n) with g defined as before indeed minimizes that average 
cost. 


14.2 LQG with Noisy Observations 


In the previous section, we controlled a linear system with Gaussian noise assuming 
that we observed the state. We now consider the case of noisy observations. 
The system is 
X (n 4- 1) 2 aX(n) 4- U(n) 4-V(n),n = 0; (14.7) 
Y (n) = X(n) 4- W(n), (14.8) 


where the random variables W (n) are i.i.d. VY (0, w?) and are independent of the 
V (n). 
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Fig. 14.3 The optimal 
control is linear in the 
estimate of the state 


U (n) 


The problem is to find, for each n, the value of U (n) based on the values of 
Y” := (Y(0), ..., Y (n)} that minimize the expected total cost (14.2). 
The following result gives the solution of the problem (Fig. 14.3). 


Theorem 14.2. Optimal LQG Control with Noisy Observations The solution of the 
problem is 


U (n) = g(N — n)X (n), 
where 
X(n) = E[X(n)|Y(0), ..., Y(n), U(0), ..., U(n — 1)] 


can be computed by using the Kalman filter and the constants g(m) are given by 
(14.3)-(14.4). 

Thus, the control values are the same as when X (n) is observed exactly, except 
that X (n) is replaced by X (n). This feature is called certainty equivalence. 


Proof The fact that the values of g(n) do not depend on the noise V (n) gives us 
some inkling as to why the result in the theorem can be expected: given Y", the 
state X (1) is W (X (n), v?) for some variance v?. Thus, we can view the noisy 
observation as increasing the variance of the state, as if the variance of V (n) were 
increased. 

Instead of providing the complete algebra, let us sketch why the result holds. 
Assume that the minimum expected cost-to-go at time N — m + 1 given YN~""*! is 


c(m — 1) - d(m — DX(N — m + 1%. 


Then, at time N — m, the expected cost-to-go given Y "7" and U(N — m) = u is 
the expected value of 


X(N — my + fi? +c(m — 1) +d(m— DX(N -m + 1)? 
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given Y "7" and U(N — m) = u. Now, 
X(N —m) = X(N — m) +n, 


where n is a Gaussian random variable independent of Y "^", Also, as we saw when 
we discussed the Kalman filter, 


X(N —m+4+1)=aX(N—m)+u 

+ K(N —m- D(Y(N — m4 1) — E[Y(N — m + DYNN. 
Moreover, we know from our study of conditional expectation of jointly Gaussian 
random variables, that Y(N — m + 1) — E[Y(N — m + D)|YN7"] is a Gaussian 
random variable that has mean zero and is independent of Y "-", Hence, 


X(N —m+1)=aX(N—m)+u+Z 


for some independent zero-mean Gaussian random variable Z. 
Thus, the expected cost-to-go at time N — m — | is the expected value of 


(X(N — m) +n)” + Bu? + c(m— 1) 
+d(m—1)(aX(N — m) + Z)’, 

Le., of 

X(N — m + Bu? + c(m — 1) - d(m — D(aX(N — m) +u + Z}. 
This expression is identical to (14.6), except that x is replaced by X(N — m) and V 
is replaced by Z. Since the variance of V does not affect the calculations of c(m) 
and d (m), this concludes the proof. o 
14.2.1 Letting N —> œ 
As when X (n) is observed exactly, one can show that, if |a| < 1, the control 

U(n) = gX (n) 


minimizes the average cost per unit time. Also, in this case, we know that the 
Kalman filter becomes stationary and has the form (Fig. 14.4) 


X(n--1) 2 aE(n) - u + K[Y(n + 1)-—af(n)-— U(n)]. 
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Fig. 14.4 The optimal 


control for the average cost X (n) 
with noisy observations. Y (n) 
Here, the Kalman filter is U (n) : 
station X(n 
ud n) KF 
g 


14.3 Partially Observed MDP 


In the previous chapter, we considered a controlled Markov chain and the action 
is based on the knowledge of the state. In this section, we look at problems where 
the state of the Markov chain is not observed exactly. In other words, we look at 
a controlled hidden Markov chain. These problems are called partially observed 
Markov decision problems (POMDPSs ). 

Instead of discussing the general version of this problem, we look at one concrete 
example to convey the basic ideas. 


14.3.1 Example: Searching for Your Keys 


The example is illustrated in Fig. 14.5. You have misplaced your keys but you 
know that they are either in bag A, with probability p, or in bag B, otherwise. 
Unfortunately, your bags are cluttered and if you spend one unit of time (say 10s) 
looking in bag A, you find your keys with probability o if they are there. Similarly, 
the probability for bag B is £8. Every time unit, you choose which bag to explore. 
Your objective is to minimize the expected time until you find your keys. 

The state of the system is the location A or B of your keys. However, you do 
not observe that state. The key idea (excuse the pun) is to consider the conditional 
probability pn that the keys are in bag A given all your observations up to time n. It 
turns out that p, is a controlled Markov chain, as we explain shortly. Unfortunately, 
the set of possible value of p; is [0, 1], which is not finite, nor even countable. Let 
us not get discouraged by this technical issue. 

Assume that at time n, when the keys are in bag A with probability pn, you look 
in bag A for one unit of time and you do not see the keys. What is then pn+1? We 
claim that 


pa (d — a) 
Pull — a) t (1 — pn 


Pn+1 = ) =: f(A, pn). 

Indeed, this is the probability that the keys are in bag A and we do not see them, 
divided by the probability that we do not see the keys (either when they are there or 
when they are not). Of course, if we see the keys, the problem stops. 
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Fig. 14.5 Where to look for 
your keys? 


Similarly, say that we look in bag B and we do not see the keys. Then 


Pn 
=: f(B, pn). 
miil- ap 6669 


Pn+1 = 


Thus, we control p, with our actions. Let V (p) be the minimum expected time 
until we find the keys, given that they are in bag A with probability p. Then, the 
DPE are 


V(p) = 1 * min((1 — pa)V(f(A, p), d — (1 — pB)VCFÉ(B, pj. || (14.9) 


The constant 1 is the duration of the first step. The first term in the minimum is what 
happens when you look in bag A. With probability 1 — po, you do not find your 
keys and you will then have to wait a minimum expected time equal to V(f (A, p)) 
to find your keys, because the probability that they are in bag A is now f(A, p). 
The other term corresponds to first looking in bag B. 

These equations look hopeless. However, they are easy to solve in Python. One 
discretizes [0, 1] into K intervals and one rounds off the updates f(A, p) and 
f (B, p). 

Thus, the updates are for a finite vector V = (V(1/K), V(2/K),..., V(D). 
With this discretization, the equations (14.9) look like 


V — (V), 


where 6$ (-) is the right-hand side of (14.9). These are fixed-point equations. To solve 
them, we initialize Vy — 0 and we iterate 


V1 = $(V), t O0. 


With a bit of luck, that can be justified mathematically, this algorithm converges to 
V, the solution of the DPE. The solution is shown in Fig. 14.6, for different values 
of œ and f. The figure also shows the optimum action as a function of p. The 
discretization uses K = 1000 values in [0, 1] and the iteration is performed 100 
times. 
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2 Look in B Look in A 


0| — Vip] 
——- sign(VA[p]-VB[p]) 


0.0 02 0.4 ^ 0:6 0/8 l0 


Fig. 14.6 Numerical solution of (14.9) 


14.4 Summary 


* LQG Control Problem with State Observations; 
* LQG Control Problem with Noisy Observations; 
e Partially Observed MDP. 


14.4.1 Key Equations and Formulas 


LQG problem Formulation (14.1)-(14.2) 
Solution of LQG Un = gN-nXn T.14.1 
Noisy observations Y, = Xn + Wr (14.8) 
Solution with noisy observations Un = 8N-n X m T.14.2 
Partially observed MDP Replace X, by P[X, = x|Y"] 8.14.3 


14.5 References 


The texts Bertsekas (2005), Kumar and Varaiya (1986) and Goodwin and Sin (2009) 
cover LQG control. The first two texts discuss POMDP. 
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14.6 Problems 
Problem 14.1 Consider the system 
X(n+ 1) — 0.8X (n) + U (n) + V (n), n = 0, 


where X (0) = 0 and the random variables V (n) are i.i.d. and WV (0, 0.2). The U (n) 
are control values. 


(a) Simulate the system when U (n) = 0 for all n > 0. 

(b) Implement the control given in Theorem 14.1 with N — 100 and simulate the 
controlled system. 

(c) Implement the control with the constant gain g = limp—+oo g(1) and simulate 
the system. 


Problem 14.2 Consider the system 


X(n 4-1) 2 6.8X(n) + U(n)+ Vin),n z 0 
Y (n) = X(n) + W(n), n = 0, 


where X(0) = O and the random variables V (n), W(n) are independent with 
V (n) =p -¥V (0, 0.2) and W(n) =p N (0, 07). 


(a) Implement the control described in Theorem 14.2 for c? = 0.1 ando? = 0.4 
and simulate the controlled system. 

(b) Implement the limiting control with the limiting gain and the stationary Kalman 
filter for o? = 0.1 and o? = 0.4. Simulate the system. 

(c) Compare the systems with the time-varying and the limiting controls. 


Problem 14.3 There are two coins. One is fair and the other one has a probability 
of "head" equal to 0.6. You cannot tell which is which by looking at the coins. At 
each step n > 1, you must choose which coin to flip. The goal is to maximize the 
expected number of “heads.” 


(a) Formulate the problem as a POMDP. 

(b) Discretize the state of the system as we did in the “searching for your keys” 
example and write the SDPEs. 

(c) Implement the SDPEs in Python and simulate the resulting system. 
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Topics: Inference, Sufficient Statistic, Infinite Markov Chains, Poisson, 
Boosting, Multi-Armed Bandits, Capacity, Bounds, Martingales, SLLN 


15.1 Inference 


One key concept that we explored is that of inference. The general problem of 
inference can be formulated as follows. There is a pair of random quantities (X, Y). 
One observes Y and one wants to guess X (Fig. 15.1). 

Thus, the goal is to find a function g(-) such that È := g(Y) is close to X, in a 
sense to be made precise. Here are a few sample problems: 


e X isthe weight of a person and Y is her height; 

e X = lisa house is on fire, X = 0 otherwise, and Y is a measurement of the CO 
density at a sensor; 

e X c {0,1} is a bit string that a transmitter sends and Y € 9119-71 is a signal that 
the receiver receives; 

* Y is one woman's genome and X — 1 if she develops a specific form of breast 
cancer and X — 0 otherwise; 

* Yisa vector of characteristics of a movie and of one person and X is the number 
of stars that the person gives to the movie; 

e Y is the photograph of a person's face and X = 1 if it is that of a man and X = 0 
otherwise; 

* Xisasentence and Y is the signal that a microphone picks up. 
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? 
Y>x 


Fig. 15.1 The inference problem is to guess the value of X from that of Y 


We explained a few different formulations of this problem in Chaps. 7 and 9: 


* Known Distribution: We know the joint distribution of (X, Y); 

* Off-Line: We observe a set of sample values of (X, Y); 

* On-Line: We observe successive values of samples of (X, Y); 

* Maximum Likelihood Estimate: We do not want to assume a distribution for 
X, only the conditional distribution of Y given X; the goal is to find the value of 
X that makes the observed Y most likely; 

* Maximum A Posteriori Estimate: We know a prior distribution for X and the 
conditional distribution of Y given X; the goal is to find the value of X that is 
most likely given Y; 

* Hypothesis Test: We do not want to assume a distribution for X € (0, 1), only a 
conditional distribution of Y given X; the goal is to maximize the probability of 
correctly deciding that X — 1 while keeping the probability that we decide that 
X = 1 when in fact X = 0 below some given f. 

* MMSE: Given the joint distribution of X and Y, we want to find the function 
g(Y) that minimizes E((X — g(Y))?). 

* LLSE:Given the joint distribution of X and Y, we want to find the linear function 
a + bY that minimizes E((X — a — bY)?). 


15.2 Sufficient Statistic 


A useful notion for inference problems is that of a sufficient statistic. We have not 
discussed this notion so far. It is time to do it. 


Definition 15.1 (Sufficient Statistic) We say that h(Y) is a sufficient statistic for 
X if 


fryixlylx] = fh). gs), 
or, equivalently, it 


fvinay,xlyls. x] = frin Ds]. 


We leave the verification of this equivalence to the reader. 
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Before we discuss the meaning of this definition, let us explore some implica- 
tions. First note that if we have a prior fx (x) and we want to calculate MAP[X|Y = 
y], we have 


MAP(X|Y = y] = arg max fx (x) fyixLylx] 
— arg max fx Go fg), x)g0) 
= arg max fx (x) f hO), x). 
Consequently, the maximizer is some function of A (y). Hence, 
MAP[X|Y] = g(A(Y)), 


for some function g(-). In words, the information in Y that is useful to calculate 
MAP[X|Y] is contained in h(Y). 

In the same way, we see that MLE[X|Y] is also a function of h(Y). 

Observe also that 


fxiyIxly] = fxG)frixlylx] — fx G) FAD), 080) 
2 fv) FO) : 


Now, 


fro [ Fe) fO) 08 0Xds = 8) | fx GO fh), x)dx 
= g)$(h(y)), 


where 
win = [ fx (x) f (hCy), x)dx. 


Hence, 


Farb] = ECOD 
uo pa ` 


Thus, the conditional density of X given Y depends only on h(Y). Consequently, 
E[X|Y] = v (h(Y)). 


Now, consider the hypothesis testing problem when X c (0, 1}. Note that 
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faxy — fad), D8) 


L(y) = = 
fvixly]0]. fh), De) 


= y(h(y). 


Thus, the likelihood ratio depends only on A(y) and it follows that the solution of 
the hypothesis testing problem is also a function of h(Y). 


15.2.1 Interpretation 


The definition of sufficient statistic is quite abstract. The intuitive meaning is that if 
h(Y) is sufficient for X, then Y is some function of A(Y) and a random variable Z 
that is independent of X and Y. That is, 


Y — g(h(Y), Z). (15.1) 


For instance, say that Y = (Y1, ..., Yn) where the Y,, are i.i.d. and Bernoulli with 
parameter X € [0, 1]. Let h(Y) = Yı +---+ Yn. Then we can think of Y as being 
constructed from A (Y) by selecting randomly which A (Y) random variables among 
(Yi, ..., Yn) are equal to one. This random choice is some independent random 
variable Z. In such a case, we see that Y does not contain any information about X 
that is not already in h(Y). 

To see the equivalence between this interpretation and the definition, first assume 
that (15.1) holds. Then 


PLY © y|X = x] = P[A(Y) © h(y)|X = x]P (e(h(y), Z) © y) 
= f(h(y), x)g y), 


so that A (Y) is sufficient for X. Conversely, if h(Y) is sufficient for X, then we can 
find some Z such that g(h(y), Z) has the density fyjs(y)Ly | Cy)]. 


15.3 Infinite Markov Chains 


We studied Markov chains on a finite state space 2 = (1,2, ..., N}. Let us explore 
the countably infinite case where 2 = (0, 1, ...]. 

One is given an initial distribution m = {z(x),x € 2}, where z (x) > 0 and 
Ler x(x) = 1. Also, one is given a set of nonnegative numbers {P (x, y), x, y € 
A^) such that 


x P(x,y)=1,Vxe X. 
yeu 


The sequence (X (n), n > 0) is then a Markov chain with initial distribution zr 
and probability transition matrix P if 
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Fig. 15.2 Aninfinite p p p p p 
Markov chain 
q (Qj u$ "ae qu 
q q q q q 


P(X(0) = xo, X(1) = x1, ..., X(n) = Xn) 
= n (xo) P(xo, x1) X +++ x P(Xn-1, Xn), 
for all n > 0 and all xo, ..., xn in Z. 

One defines irreducible and aperiodic as in the case of a finite Markov chain. 
Recall that if a finite Markov chain is irreducible, then it visits all its states infinitely 
often and it spends a positive fraction of time in each state. 

That may not happen when the Markov chain is infinite. To see this, consider the 


following example (see Fig. 15.2). One has x (0) = 1 and P(i, + 1) = p fori > 1 
and 


PG@+1,i)=1-p=:q= P(0,0)vVi. 

Assume that p € (0, 1). Then the Markov chain is irreducible. However, it is 
intuitively clear that X(n) — oo as n — oo if p > 0.5. To see that this is indeed 
the case, let Z(n) be i.i.d. random variables with P(Z(n) = 1) = p and P(Z(n) = 
—1) = q. Then note that 

X (n) = max(X (n — 1) + Z(n), 0}, 
so that 
X(n) > XO) + Z(D t: Z(n— Dn z 0. 
Also, 


X(n) E X(0) + ZU) +--+ Zn- 1) 
n o n 


> E(Z(n)) > 0, 


where the convergence follows by the SLLN. This implies that X(n) — oo, as 
claimed. 

Thus, X (n) eventually is larger than any given N and remains larger. This shows 
that X (n) visits every state only finitely many times. We say that the states are 
transient because they are visited only finitely often. 

We say that a state is recurrent if it is not transient. In that case, the state is called 
positive recurrent if the average time between successive visits is finite; otherwise 
it is called null recurrent. 

Here is the result that corresponds to Theorem 1.1 
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Theorem 15.1 (Big Theorem for Infinite Markov Chains) Consider an infinite 
Markov chain. 


(a) If the Markov chain is irreducible, the states are either all transient, all positive 
recurrent, or all null recurrent. We then say that the Markov chain is transient, 
positive recurrent, or null recurrent, respectively. 

(b) If the Markov chain is positive recurrent, it has a unique invariant distribution 
x and m (i) is the long-term fraction of time that X (n) is equal to i. 

(c) If the Markov chain is positive recurrent and also aperiodic, then the distribu- 
tion Tn of X (n) converges to m. 

(d) If the Markov chain is not positive recurrent, it does not have an invariant 
distribution and the fraction of time that it spends in any state goes to zero. 


It turns out that the Markov chain in Fig. 15.2 is null recurrent for p — 0.5 and 
positive recurrent for p « 0.5. In the latter case, its invariant distribution is 


z(i)-—(1— p)p',i > 0, where p :— E 
q 


15.3.1 Lyapunov-Foster Criterion 
Here is a useful sufficient condition for positive recurrence. 


Theorem 15.2 (Lyapunov-Foster) Let X (n) be an irreducible Markov chain on 
an infinite state space 2. Assume there exists some function V : X — [0, oo) 
such that 


E[V(X (n + 1)) — V(X())X (1) = x] S ~a + Blix € Aj, 


where A is a finite set, a > 0 and p > 0. 
Then the Markov chain is positive recurrent. 
Such a function V is said to be a Lyapunov function for the Markov chain. 


The condition means that the Lyapunov function decreases by at least o on 
average when X(n) is outside some finite set A. The intuitive reason why this 
makes the Markov chain positive recurrent is that, since the Lyapunov function 
is nonnegative, it cannot decrease forever. Thus, it must spend a positive fraction 
of time inside the finite set A. By the big theorem, this implies that it is positive 
recurrent. 
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15.4 Poisson Process 


The Poisson process is an important model in applied probability. It is a good 
approximation of the arrivals of packets at a router, of telephone calls, of new TCP 
connections, of customers at a cashier. 


15.4.1 Definition 
We start with a definition of the Poisson process. (See Fig. 15.3.) 


Definition 15.2 (Poisson Process) Let A > 0 and (51, S2,...} be iid. Exp(A) 
random variables. Let also 7, = $1 + ---+ S, for n > 1. Define 


N, = max{n > 1|T, < tj, t > 0, 


with N; = O ift < Tı. Then, N := (N;,t > 0] is a Poisson process with rate A. 
Note that T, is the n-th jump time of N. 


15.4.2 Independent Increments 


Before exploring the properties of the Poisson process, we recall two properties of 
the exponential distribution. 


Theorem 15.3 (Properties of Exponential Distribution) Let t be exponentially 
distributed with rate X > 0. That is, 


F(t) = P(t €t) 21— exp(—At), t > 0. 


Fig. 15.3 Poisson process: 
the times S,, between jumps 
are 1.i.d. and exponentially 
distributed with rate A 
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In particular, the pdf of x is f. (t) = Xexp(—At) for t > 0. Also, E(t) = A^! and 
var(t) = A?. 
Then, 
Pit >t+s|lt >s]=P(t >t). 


This is the memoryless property of the exponential distribution. 
Also, 


Plt <t+e|t >t] = Xe + o(e). 


a 
Proof 
Pit >t+s|t>s]= aa 
S exp ACD o 
= = a = exp{—Ar} 
= P(t>t). 
Oo 


The interpretation of this property is that if a lightbulb has an exponentially 
distributed lifetime, then an old bulb is exactly as good as a new one (as long as 
it is still burning). 

We use this property to show that the Poisson process is also memoryless, in a 
precise sense. 


Theorem 15.4 (Poisson Process Is Memoryless) Let N := {N;,t > O} is a 
Poisson process with rate X. Fix t > 0. Given (N,,s < t}, the process (Ns4; — 
N;, s > 0} is a Poisson process with rate x. 

As a consequence, the process has stationary and independent increments. That 
is, for any 0 € t1 < t2 < ---, the increments (Ny, — Ni, n = 1} of the Poisson 
process are independent and the distribution of N;, ,, — N, depends only on tn+1—tn. 


Proof Figure 15.4 illustrates that result. Given (N,, s < t}, the first jump time of 
(Ns; — Ni, s > 0] is Exp(A), by the memoryless property of the exponential 
distribution. The subsequent inter-jump times are i.i.d. and Exp(A). This proves the 
theorem. o 
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Fig. 15.4 Given the past of 
the process up to time f, the 
future jump times are those of 
a Poisson process 


15.4.3 Number of Jumps 
One has the following result. 


Theorem 15.5 (The Number of Jumps Is Poisson) N :— (N;,t > 0} isa Poisson 
process with rate à. Then N; has a Poisson distribution with mean At. 


Proof There are a number of ways of showing this result. The standard way is as 
follows. Note that 


P(N,, = n) = P(N; = n) — Ae) + P(N; 2n — De + o(e). 
Hence, 
d 
Thus, 
d 


Since P(No = 0) = 1, this shows that P(N; = 0) = exp(—At) for t > 0. Now, 
assume that 


P(N, = n) = g(n, t) exp{—At},n > 0. 


Then, the differential equation above shows that 


d 
aq; E D expl 740] = Alg(n — 1,1) — gin, D] exp{—At}, 
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L gin, t) = Ag(n — 1, t). 


This expression shows by induction that g(n, t) = onm. 
A different proof makes use of the density of the jumps. Let T, be the n-th jump 
of the process and S, = Ta — T,—1, as before. Then 


P(Tj € (ti, ti + dti), ..., T4 € (tn, tn + dtn), Thi > t) 
= P(Si € (ti, ti + dfi), ..., Sn € (th — tni tn 
—t—-1 + dtn), Sn+1 > t — tn) 
= hexp{—At }dt)A exp{—A(t2 — ti))dto - - - exp( —ÀA(t — t,)} 
= A" dt, -- dt, exp{—At}. 


To derive this expression, we used the fact that the S, are iid. Exp(A). The 
expression above shows that, given that there are n jumps in [0, t], they are equally 
likely to be anywhere in the interval. Also, 


P(N; =n) = f A" dt +++ dt, exp{—At}, 
S 


where S = {t1,...,tn]0 < ti <--- < tn < t}. Now, observe that S is a fraction of 
[0, t]” that corresponds to the times t; being in a particular order. There are n! such 
orders and, by symmetry, each order corresponds to a subset of [0, t]” of the same 
size. Thus, the volume of S is t" /n!. We conclude that 


t” 
P(N; =n) = — exp(—At], 
n! 


which proves the result. o 


15.5 Boosting 


You follow the advice of some investment experts when you buy stocks. Their 
recommendations are often contradictory. How do you make your decisions so that, 
in retrospect, you are not doing too bad compared to the best of the experts? The 
intuition is that you should try to follow the leader, but randomly. To make the 
situation concrete, Fig. 15.5 shows three experts (B, I, T) and the profits one would 
make by following their advice on the successive days. 

On a given day, you choose which expert to follow the next day. Figure 15.6 
shows your profit if you make the sequence of selections indicated by the red circles. 
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Profit when following recommendation of that 


Three experts: expert on that day 


B, 1, T 


P(B) =2-3+2-44+1=-2 
P(I) =-1+2+1-2+0=0 
P(T) = -2+6+3-5-3=-1 


Fig. 15.5 The three experts and the profits of their recommended stocks 


P(B) = -2,P(I) = 0, P(T) = -1 
Assume algorithm makes the following sequence of 
expert selections: 


The profit is: 
. F=2-34+1-2-3=-5 
g The regrets of this algorithm are: 


—2+5 3 
Q R=P- F1 = 0+5 =| 5 
-1+5 1 


Fig. 15.6 A specific sequence of choices and the resulting profit and regrets 


In these selections, you choose to follow B the first 2 days, then Z the next to 
days, then T the last day. Of course, you have to choose the day before, and the 
actual profit is only known the next day. The figure also shows the regrets that you 
accumulate when comparing your profit to that of the three experts. Your total profit 
is —5 and the profit you would have made if you had followed B all the time would 
have been —2, so your regret compared to B is —2 — (—5) — 3, and similarly for 
the other two experts. 

The problem is to make the expert selection every day so as to minimize the worst 
regret, i.e., the regret with respect to the most successful expert. More precisely, the 
goal is to minimize the rate of growth of the worst regret. Here is the result. 


Theorem 15.6 (Minimum Regret Algorithm) Generally, the worst regret grows 
like O (A/n) with the number n of steps. One algorithm that achieves this rate of 
regret is to choose expert E at step n + 1 with probability t4 1 (E) given by 
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Fig. 15.7 A simulation of 70 . a 
the experts and the selection 
algorithm 


60 


P^. , Expert selection 
Spy P algorithm 


walks with drift 0.1) 


} Three experts (here, random 


100 200 300 400 500 


Tt COE) = Ag exp{a P, (E)/ An), for E € (B, LT), 


where n > 0 is a constant, A, is such that these probabilities add up to one, and 
P, (E) is the profit that expert E makes in the first n days. 


Thus, the algorithm favors successful experts. However, the algorithm makes 
random selections. It is easy to construct examples where a deterministic algorithm 
accumulates a regret that grows like n. 

Figure 15.7 shows a simulation of three experts and of the selection algorithm 
in the theorem. The experts are random walks with drift 0.1. The simulation shows 
that the selection algorithm tends to fall behind the best expert by O (vn). 

The proof of the theorem can be found in Cesa-Bianchi and Lugosi (2006). 


15.6 Multi-Armed Bandits 


Here is a classical problem. You are given two coins, both with an unknown bias 
(the probability of heads). At each step k = 1,2,... you choose a coin to flip. 
Your goal is to accumulate heads as fast as possible. Let X, be the number of heads 
you accumulate after k steps. Let also X7 be the number of heads that you would 
accumulate if you always flipped the coin with the largest bias. The regret of your 
strategy after n steps is defined as 


Rg = E(X% — Xx). 
Let 01 and 62 be the bias of coins 1 and 2, respectively. Then E (Xj) = k max(61, 02] 


and the best strategy is to flip the coin with the largest bias at each step. However, 
since the two biases are unknown, you cannot use that strategy. We explain below 
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that there is a strategy such that the regret grows like log(k) with the number of 
steps. 

Any good strategy keeps on estimating the biases. Indeed, any strategy that stops 
estimating and then forever flips the coin that is believed to be best has a positive 
probability of getting stuck with the worst coin, thus accumulating a regret that 
grows linearly over time. Thus, a good strategy must constantly explore, i.e., flip 
both coins to learn their bias. 

However, a good strategy should exploit the estimates by flipping the coin that is 
believed to be better more frequently than the other. Indeed, if you were to flip the 
two coins the same fraction of time, the regret would also grow linearly. Hence, a 
good strategy must exploit the accumulated knowledge about the biases. 

The key question is how to balance exploration and exploitation. The strategy 
called Thompson Sampling does this optimally. Assume that the biases 0; and 
62 of the two coins are independent and uniformly distributed in [0, 1]. Say that 
you have flipped the coins a number of times. Given the outcomes of these coin 
flips, one can in principle compute the conditional distributions of 0; and 62. Given 
these conditional distributions, one can calculate the probability that 6; > 02. The 
Thompson Sampling strategy is to choose coin 1 with that probability and coin 2 
otherwise for the next flip. Here is the key result. 


Theorem 15.7 (Minimum Regret of Thompson Sampling) /f the coins have 
different biases, then any strategy is such that 


Ry > O(logk). 
Moreover, Thompson Sampling achieves this lower bound. 
m" 


The notation O(log k) indicates a function g(k) that grows like log k, i.e., such 
that g(k)/ log k converges to a positive constant as k — oo. 

Thus this strategy does not necessarily choose the coin with the largest expected 
bias. It is the case that the strategy favors the coin that has been more successful so 
far, thus exploiting the information. But the selection is random, which contributes 
to the exploration. 

One can show that if flips of coin 1 have produced h heads and f tails, then the 
conditional density of 6; is g(0; h, t), where 


(h+t)! 


m 6^(1 — 6y',0 e [0, 1]. 


g(8; h, t) = 


The same result holds for coin 2. Thus, Thompson Sampling generates Ó, and 6 
according to these densities. 

For a proof of this result, see Agrawal and Goyal (2012). See also Russo et al. 
(2018) for applications of multi-armed bandits. 
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A rough justification of the result goes as follows. Say that 6; > 62. One can 
show that after flipping coin 2 a number n of times, it takes about n steps until you 
flip it again when using Thompson Sampling. Your regret then grows by one at times 
1,1+1,2+2,4+4,...,2",2"+! __.. Thus, the regret is of order n after O(2”) 
steps. Equivalently, after N = 2" steps, the regret is of order n = log N. 


15.7 Capacity of BSC 


Consider a binary symmetric channel with error probability p € (0, 0.5). Every bit 
that the transmitter sends has a chance of being corrupted. Thus, it is impossible 
to transmit any bit string fully reliably across this channel. No matter what the 
transmitter sends, the receiver can never be sure that it got the message right. 

However, one might be able to achieve a very small probability of error. For 
instance, say that p = 0.1 and that one transmits a bit by repeating it N times, 
where N > 1. As the receiver gets the N bits, it uses a majority decoding. That is, 
if it gets more zeros than ones, it decides that transmitter sent a zero, and conversely 
for a one. The probability of error can be made arbitrarily small by choosing N very 
large. However, this scheme gets to transmit only one bit every N steps. We say 
that the rate of the channel is 1/N and it seems that to achieve a very small error 
probability, the rate has to become negligible. 

It turns out that our pessimistic conclusion is wrong. Claude Shannon (Fig. 15.8), 
in the late 1940s, explained that the channel can transmit at any rate less than C(p), 
where (see Fig. 15.9) 


C(p) = 1 — H(p) with H(p) = —p log, p — (1 — p) logy (1 — p). (15.2) 


with a probability less than e, for any € > 0. 

For instance, C (0.1) ~ 0.53. Fix a rate less than C(0.1), say R = 0.5. Pick any 
€ > 0, say e = 1075. Then, it is possible to transmit bits across this channel at rate 
R = 0.5, with a probability of error per bit less than 1078. The same is true if we 
choose e = 107!?: it is possible to transmit at the same rate R with a probability of 


Fig. 15.8 Claude Shannon. 
1916-2001 
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Fig. 15.9 The capacity C(p) 1 ] 


of the BSC with error | 
probability p 0.8 
C(p) 

0.6 

0.4 

0.2 

P 
0 — s 
0 0.2 0.4 0.6 0.8 1 


error less than 10-!?, The actual scheme that we use depends on e, and it becomes 
more complex when e is smaller; however, the rate R does not depend on e. Quite a 
remarkable result! Needless to say, it baffled all the engineers who had been busily 
designing various ad hoc transmission schemes. 

Shannon's key insight is that long sequences are typical. There is a statistical 
regularity in random sequences such as Markov chains or i.i.d. random variables and 
this regularity manifests itself in a characteristic of long sequences. For instance, flip 
many times a biased coin with P (head) = 0.1. The sequence that you will observe 
is likely to have about 1096 of heads. Many other sequences are so unlikely that you 
will not see them. Thus, there are relatively few long sequences that are possible. 
In this example, although there are M = 2M possible sequences of N coin flips, 
only about VM are typical when P(head) = 0.1. Moreover, by symmetry, these 
typical sequences are all equally likely. For that reason, the errors of the BSC must 
correspond to relatively few patterns. Say that there are only A possible patterns of 
errors for N transmissions. Then, any bit string of length N that the sender transmits 
will correspond to A possible received “output” strings: one for every typical error 
sequence. Thus, it might be possible to choose B different "input" strings of length 
N for the transmitter so that the A received "output" strings for each one of these 
B input strings are all distinct. However, one might worry that choosing the B input 
strings would be rather complex if we want their sets of output strings to be distinct. 

Shannon noticed that if we pick the input strings completely randomly, this will 
work. Thus, Shannon scheme is as follows. Pick a large N. Choose B strings of N 
bits randomly, each time by flipping a fair coin N times. Call these inputs strings 
X4, ... Xp. These are the codewords. Let S; be the set of A typical outputs that 
correspond to X;. Let Y be the output that corresponds to input X;. Note that the 
Y j are sequences of fair coin flips, by symmetry of the channel. Thus, each Y; 
is equally likely to be any one of the 2" possible output strings. In particular, the 
probability that Y ; falls in S; is A/2N (Fig. 15.10). 

In fact, 


P(Y2 € Sı or Y3 € Sy... or Yg e S1) < Bx AD, 
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Fig. 15.10 Because of the 2N output strings 
random choice of the 
codewords, the likelihood that > o Y. 
one codeword produces an Output seing 
output that is typical for that corresponds 
another codeword is A2-N P to input X; 

A — 2NH(p) typical 


output strings 
for input string X i 


Indeed, the probability of a union of events is not larger than the sum of their 
probabilities. We explain below that A = 2 (P), Thus, if we choose B = 2^8, we 
see that the expression above is less than or equal to 


ONE x NHCP) x aN 


and this expression goes to zero as N increases, provided that 
R + H(p) <1, i.e., R < C(p):= 1 — H(p). 


Thus, the receiver makes an error with a negligible probability if one does not choose 
too many codewords. Note that B = 2 corresponds to transmitting N R different 
bits in N steps, thus transmitting at rate R. 

How does the receiver recognize the bit string that the transmitter sent? The idea 
is to give the list of the B input strings, i.e., codewords, to the receiver. When 
it receives a string, the receiver looks in the list to find the codeword that is the 
closest to the string it received. With a very high probability, it is the string that the 
transmitter sent. 

It remains to show that A = 2"), Fortunately, this calculation is a simple 
consequence of the SLLN. Let X := {X(n),n = 1,..., N} be iid. random 
variables with P(X (n) = 1) = p and P(X (n) = 0) = 1 — p. For a given sequence 
x = (x(D, ..., x(N)) € (0, 1)", let 


1 
w(x) := N log; (P(X = x)). (15.3) 


Note that, with |x| := X; x(n), 


V (x) esr (pl — p^) 
N So (P P 


xl 
N 


N — |x| 
logy(p) + UN log;(1 — p). 
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Thus, the random string X of N bits is such that 


_ XI N — |X| 
W(X) = "X7 LLP) + 


log;(1 — p). 
But we know from the SLLN that |X|/N — p as N — oo. Thus, for N > 1, 


V (X) © plog5(p) + (1 — p) log, — p) =: —H(p). 


This calculation shows that any sequence x of values that X takes has approximately 
the same value of y (x). But, by (15.3), this implies all the sequences x that occur 
have approximately the same probability 


2-NH(p) 


We conclude that there are 2" P (P) typical sequences and that they are all essentially 
equally likely. Thus, A = 27), 

Recall that for the Gaussian channel with the MLE detection rule, the channel 
becomes a BSC with 


p = p(o?) := P(N (0, 0?) > 0.5). 


Accordingly, we can calculate the capacity C(p(o?)) as a function of the noise 
standard deviation o. Figure 15.11 shows the result. 

These results of Shannon on the capacity, or achievable rates, of channels 
have had a profound impact on the design of communication systems. Suddenly, 
engineers had a target and they knew how far or how close their systems were 
to the feasible rate. Moreover, the coding scheme of Shannon, although not really 
practical, provided a valuable insight into the design of codes for specific channels. 
Shannon's theory, called Information Theory, is an inspiring example of how a 
profound conceptual insight can revolutionize an engineering field. 


Fig. 15.11 The capacity of 1A, Y r 
the BSC that corresponds to a \ Ceo?) 
WN (0, o?) additive noise. The 68 \ | 
detector uses the MLE ! \ 

0.6 

0.4 \ | 

0.2 

— [s] 
0 L 1 13 
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Another important part of Shannon's work concerns the coding of random 
objects. For instance, how many bits does it take to encode a 500-page book? 
Once again, the relevant notion is that of typicality. As an example, we know 
that to encode a string of N flips of a biased coin with P(head) — p, we need 
only N H (p) bits, because this is the number of typical sequences. Here, H (p) is 
called the entropy of the coin flip. Similarly, if (X(n),n > 1) is an irreducible, 
finite, and aperiodic Markov chain with invariant distribution x and transition 
probabilities P (i, j), then one can show that to encode {X (1), ..., X(V)} one needs 
approximately N H (P) bits, where 


H(P) 2 - Min) | PG, log; PG, j) 


J 


is called the entropy rate of the Markov chain. A practical scheme, called Liv— 
Zempel compression, essentially achieves this limit. It is the basis for most file 
compression algorithms (e.g., ZIP). 

Shannon put these two ideas together: channel capacity and source coding. Here 
is an example of his source-channel coding result. How fast can one send the 
symbols X (n) produced by the Markov chain through a BSC channel? The answer 
is C(p)/ H (P). Intuitively, it takes H (P) bits per symbol X (n) and the BSC can 
send C(p) bits per unit time. Moreover, to accomplish this rate, one first encodes 
the source and one separately chooses the codewords for the BSC, and one then 
uses them together. Thus, the channel coding is independent of the source coding 
and vice versa. This is called the separation theorem of Claude Shannon. 


15.8 Boundson Probabilities 


We explain how to derive estimates of probabilities using Chebyshev and Chernoff's 
inequalities and also using the Gaussian approximation. These methods also provide 
a useful insight into the likelihood of events. The power of these methods is that they 
can be used in very complex situations. 


Theorem 15.8 (Markov, Chernoff, and Jensen Inequalities) Let X be a random 
variable. Then one has 


(a) Markov's Inequality:! 


ECf (X) 


P(X 
(X2a)< f(a) 


(15.4) 


for all f (-) that is nondecreasing and positive. 


'Markov’s inequality is due to Chebyshev who was Markov’s teacher. 
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Fig. 15.12 Herman 
Chernoff. b. 1923 


Fig. 15.13 Johan Jensen. 
1859-1925 


(b) Chernoff's Inequality (Fig. 15.12):? 


P(X > a) < E(exp(6(X — a)}), (15.5) 
for all 0. 0. 
(c) Jensen’s Inequality (Fig. 15.13): 
f(E(X)) < E(f(X)), (15.6) 
for all f (-) that is convex. 
a 


These results are easy to show, so here is a proof. 


?Chernoff's inequality is due to Herman Rubin (see Chernoff (2004)). 
3Jensen’s inequality seems to be due to Jensen. 
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Proof 


(a) Since f (-) is nondecreasing and positive, we have 


UX 2a) < IC 


~ flay’ 


so that (15.4) follows by taking expectations. 

(b) The inequality (15.5) is a particular case of Markov’s inequality (15.4) for 
f (X) = exp{0 X} with 6 > 0. 

(c) Let f(-) be a convex function. This means that it lies above any tangent. In 
particular, 


fO = f(E(X)) + f'GEQO)Y(X — E(X)), 


as shown in Fig. 15.14. The inequality (15.6) then follows by taking expecta- 
tions. 
oO 


15.8.1 Applying the Bounds to Multiplexing 


Recall the multiplexing problem. There are N users who are independently active 
with probability p. Thus, the number of active users Z is B(N, p). We want to find 
m so that P(Z > m) = 596. 

As a first estimate of m, we use Chebyshev's inequality (2.2) which says that 


P(\v— EW)| > €) < = 


Fig. 15.14 A convex 
function f (-) lies above its 
tangents. In particular, it lies 
above a tangent at E(X), 
which implies Jensen’s 
inequality. 


4 


, i EX) 
JEX) + ale - E(X) 
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Now, if Z = B(N, p), one has E(Z) = Np and var(Z) = Np(1 — p^ Hence, 
since v = B(100, 0.2), one has E(v) = 20and var(v) = 16. Chebyshev's inequality 
gives 


16 
P(|v — 20| > €) € ae 
€ 


Thus, we expect that 


8 
P(v — 20 > €) < — 


e?’ 
because it is reasonable to think that the distribution of v is almost symmetric around 
its mean, as we see in Fig. 3.4. We want to choose m = 20 + € so that P(v > m) < 
5%. This means that we should choose e so that 8/e* = 5%. This gives € = 13, so 
that m — 33. Thus, according to Chebyshev's inequality, it is safe to assume that no 
more than 33 users are active and we can choose C so that C/33 is a satisfactory 
rate for users. 
As a second approach, we use Chernoff's inequality (15.5) which states that 


P(v > Na) < E(exp(0(v — Na)}), V0 > 0. 


To calculate the right-hand size, we note that if Z = Bernoulli (N, p), then we can 
write as Z = X(1) +---+ X(N), where the X (n) are 1.1.d. random variables with 
P(X (n) = 1) = p and P(X(n) = 0) = 1 — p. Then, 


E(exp(0Z]) = E(exp(0 X(1) +--- + 0X(N)}) 
= E(exp(0 X (1)) x --- x exp(0X (N))). 
To continue the calculation, we note that, since the X(n) are independent, so 
are the random variables exp(0 X (n)). Also, the expected value of a product 


of independent random variables is the product of their expected values (see 
Appendix A). Hence, 


E(exp(8Z]) = E(exp( X (1))) x --- x E(expl9 X (N))) 
= E(exp(6 X (D))" = exp{NA@)} 


where we define 


A(8) = log(E (exp{9X (1)))). 


4See Appendix A. 
5Indeed, functions of independent random variables are independent. See Appendix A. 
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" 
0.22 0.24 


Fig. 15.15 The logarithm divided by N of the probability of too many active users 


Thus, Chernoff's inequality says that 


P(Z > Na) € exp(N A(0)) exp( —0Na] 
= exp{N(A(@) — 6a)} 


Since this inequality holds for every 0 > 0, let us minimize the right-hand side with 
respect to 0. That is, let us define 


A*(a) = max(0a — A(0)]. 
0-0 
Then, we see that 
P(Z > Na) x exp( - N A*(a)]. (15.7) 


Figure 15.15 shows this function when p — 0.2. 
We now evaluate A(0) and A* (a). We find 


E(exp{0X (1)}) = 1— p+ pe^, 
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so that 
A(8) = log(1— p + pe?) 
and 


A*(a) = max(0a — log(1 — p + pe’)}. 
0-0 


Setting to zero the derivative with respect to 0 of the term between brackets, we find 


Doe) 
a= ——————Àà ; 
1 — p+ pe? ia 
which gives, fora > p, 
g _ a(l— p) 
(1 —a)p 


Substituting back in A*(a), we get 


x a l—-a 
A*(a) = —alog(—) — (1 — a) log( ), Va > p. 
p Lg 


Going back to our example, we want to find m — Na so that 
P(v 7 Na) & 0.05. 
Using (15.7), we need to find Na so that 


exp{—N A* (a)) ~ 0.05 = exp{log(0.05)}, 


log(0.05) xo 


A*(a) = N 


03. 
Looking at Fig. 15.15, we find a = 0.30. This corresponds to m = 30. Thus, 
Chernoff's estimate says that P(v > 30) ~ 5% and that we can size the network 
assuming that only 30 users are active at any one time. 

By the way, the calculations we have performed above show that Chernoff's 
bound can be written as 


arangas up) Nay 
P(B(N,a) = Na) 
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15.9 Martingales 


A martingale represents the sequence of fortunes of someone playing a fair game 
of chance. In such a game, the expected gain is always zero. A simple example is a 
random walk with zero-mean step size. Martingales are good models of noise and 
of processes discounted based on their expected value (e.g., the stock market). This 
theory is due to Doob (1953). 

Martingales have an important property that generalizes the strong law of large 
numbers. It says that a martingale bounded in expectation converges almost surely. 
This result is used to show that fluctuations vanish and that a process converges to its 
mean value. The convergence of stochastic gradient algorithms and approximations 
of random processes by differential equations follow from that property. 


15.9.1 Definitions 


Let X; be the fortune at time n > 0 when one plays a game of chance. The game is 
fair if 


E[X441]X"] = Xn, Vn > 0. (15.8) 


In this expression, X" :— {Xm, m < n). Thus, in a fair game, one cannot expect 
to improve one's fortune. A sequence (X5, > 0} of random variables with that 
property is a martingale. 

This basic definition generalizes to the case where one has access to additional 
information and is still unable to improve one's fortune. For instance, say that the 
additional information is the value of other random variables Y„. One then has the 
following definitions. 


Definition 15.3 (Martingale, Supermartingale, Submartingale) The sequence 
of random variables (X,, n > 0) is a martingale with respect to (X,, Yn, n > O} if 


E[Xn41|X", Y"] = Xn,Vn = 0 (15.9) 
with X" = {Xm, m <n} and Y" = {Ym,m < n}. 
If (15.9) holds with = replaced by <, then X, is a supermartingale; if it holds 
with >, then X, is a submartingale. 


o 


In many cases, we do not specify the random variables Y„ and we simply say that 
Xn is a martingale, or a submartingale, or a supermartingale. 
Note that if X, is a martingale, then 


E(X,) = E(Xo), Vn > 0. 
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Indeed, E(X,) = E(E[Xn|Xo, Yo]) by the smoothing property of conditional 
expectation (see Theorem 9.5). 


15.9.2 Examples 

A few examples illustrate the definition. 

Random Walk 

Let {Z,, > 0) be independent and zero-mean random variables. Then X, :— 
Zo +---+ Zn forn > 01s a martingale. Indeed, 


E[Xn41|X"] = E[Zo + +--+ Zn + Zn+41|Zo,--- Zn] = Zo +: + Zn = Xn. 


Note that if E(Z,) < 0, then X, is a supermartingale; if E(Z,) > 0, then X, is a 
submartingale. 


Product 
Let {Zn, n > 0} be independent random variables with mean 1. Then Xn := Zo x 
X Zn for n > 0is a martingale. Indeed, 


E[Xn41|X"] = E[Zo X ++: X Zn X Zn41|Zo,.--, Zn] = Zo X +++ X Zn = Xn. 


Note that if Z, > Oand E(Z,) < 1 forall n, then X, is a supermartingale. Similarly, 
if Z, > 0 and E(Z,,) > 1 for all n, then X, is a submartingale. 


Branching Process 
For m > 1 andn > 0, let X7, be i.i.d. random variables distributed like X that take 


values in Z4. :— (0, 1, 2,...} and have mean u. The branching process is defined 
by Yo — 1 and 


The interpretation is that there are Y, individuals in a population at the n-th 
generation. Individual m in that population has X7, children. 


One can see that 
Zn = je "Yu nz0 
is a martingale. Indeed, 


E[Ys41|Yo, —— ^E nb, 
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so that 
BZ ail Zee Za] = Bla" aoa Yn] = U” Yn = Zn. 


Let f(s) = E(e' X) and q be the smallest nonnegative solution of q = f(q). 
One can then show that 


W, -2q^,nzl 
is a martingale. 
Proof Exercise. Oo 


Doob Martingale 
Let {X,,n = 1,..., N} be random variables and Y = f(X1,..., Xy), where f is 
some bounded measurable real-valued function. Then 


Zn = E[Y | X"1,n=0,...,N 


is a martingale (by the smoothing property of conditional expectation, see Theo- 
rem 9.5) called a Doob martingale. Here are a two examples. 


1. Throw N balls into M bins, and let Y be some function of the throws: the 
number of empty bins, the max load, the second-highly loaded bin, or some 
similar function. Let X, be the index of the bin into which ball n lands. Then 
Zn = E[Y | X"] is a martingale. 

2. Suppose we have r red and b blue balls in a bin. We draw balls without 
replacement from this bin: what is the number of red balls drawn? Let X, be 
the indicator for whether ball n is red, and let Y = X, +---+ X, be the number 
of red balls. Then Z, is a martingale. 


You Cannot Beat the House 
To study convergence, we start by explaining a key property of martingales that says 
there is no winning recipe to play a fair game of chance. 


Theorem 15.9 (You Cannot Win) Let X, be a martingale with respect to 
(Xs, Zn, n > 0} and V, some bounded function of (X" , Z”). Then 


n 


Y, = p» Vm-1(Xm — Xm-1) n = I, (15.10) 


m=0 


with Yo :— 0 is a martingale. 
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Proof One has 


EIE, e] 27-8 
= E[V, 4(X, — X41) | X71, Z1] 


= VaE[X, — X44 | X71, Z771] 5 0. 
oO 


The meaning of Y, is the fortune that you would get by betting V,,_; at time 
m— 1 on the gain Xm — X,,.. of the next round of the game. This bet must be based 
on the information (X”~!, Z"-1) that you have when placing the bet, not on the 
outcome of the next round, obviously. The theorem says that your fortune remains 
a martingale even after adjusting your bets in real time. 


Stopping Times 

When playing a game of chance, one may decide to stop after observing a particular 
sequence of gains and losses. The decision to stop is non-anticipative. That is, one 
cannot say “never mind, I did not mean to play the last three rounds.” Thus, the 
random stopping time t must have the property that the event {t < n} must be a 
function of the information available at time n, for all n > 0. Such a random time is 
a stopping time. 


Definition 15.4 (Stopping Time) A random variable t is a stopping time for the 
sequence {X}, Yn, n > 0} if t takes values in (0, 1, 2,...} and 


Plt < n|Xm, Ym, m = 0] = bn (X", Y"), Vn = 0 


for some functions $,. 


For instance, 
t = min(n > 0 | (Xn, Yn) € A}, 


where .& is a set in 9i? is a stopping time for the sequence (X,, Yn, n > 0). Thus, 
you may want to stop the first time that either you go broke or your fortune exceeds 
$1000.00. 

One might hope that a smart choice of when to stop playing a fair game could 
improve one's expected fortune. However, that is not the case, as the following fact 
shows. 
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Theorem 15.10 (Optional Stopping) Let (X,,n > 0} be a martingale and t a 


stopping time with respect to (X,, Yn, n > 0}. Then 
E[XzcAn|Xo. Yo] = Xo. 
| 


In the statement of the theorem, for a random time o one defines Xo :— X, when 
o =n. 


Proof Note that X;4, is the fortune Y, that one accumulates by betting Vm = 
1{t ^n > m) at time m in (15.10), i.e., by betting 1 until one stops at time t ^ n. 
Since l{t An > mj = 1—{t ^n x m} = $(X" , Y"), the resulting fortune is a 
martingale. o 


You will note that bounding t ^ in the theorem above is essential. For instance, 
let X, correspond to the random walk described above with P(Z, = 1) = P(Z, = 
—1) = 0.5. If we define t = min{n > 0| X, = 10}, one knows that r is finite. (See 
the comments below Theorem 15.1.) Hence, X, = 10, so that 


E[X,|Xo = 0] = 10 Z Xo. 
However, if we bound the stopping time, the theorem says that 
E[XrAn|Xo = 0] = 0. (15.11) 


This result deserves some thought. 
One might be tempted to take the limit of the left-hand side of (15.11) as n — oo 
and note that 


lim Xra~n = X, = 10, 
noo 


because r is finite. One then might conclude that the left-hand size of (15.11) goes 
to 10, which would contradict (15.11). However, the limit and the expectation do 
not interchange because the random variables X44, are not bounded. However, if 
they were, one would get E[X;|X9o] = Xo, by the dominated convergence theorem. 
We record this observation as the next result. 


Theorem 15.11 (Optional Stopping—2) Let (X5, n > 0} be a martingale and t 
a stopping time with respect to (X4, Yn, n > 0). Assume that |X,| < V for some 
random variable V such that E(V) < oo. Then 


E[X:|Xo, Yo] = Xo. 


6t An := min(r, n] 


15.9 Martingales 299 


L-Bounded Martingales 

An L!-bounded martingale cannot bounce up and down infinitely often across an 
interval [a, b]. For if it did, you could increase your fortune without bound by 
betting 1 on the way up across the interval and betting 0 on the way down. We 
will see shortly that this cannot happen. As a result, the martingale must converge. 
(Note that this is not true if the martingale is not L!-bounded, as the random walk 
example shows.) 


Theorem 15.12 (L'-Bounded Martingales Convergence) Let (X,,n > 0} bea 
martingale such that E(|Xn|) < K for all n. Then X, converges almost surely to a 
finite random variable X s. 


Proof Consider an interval [a, b]. We show that X,, cannot up-cross this interval 
infinitely often. (See Fig. 15.16.) Let us bet 1 on the way up and 0 on the way down. 
That is, wait until X, gets first below a, then bet 1 at every step until X, > b, then 
stop betting until X, gets below a, and continue in this way. 

If Xm crossed the interval U, times by time n, your fortune Y; is now at least 
(b — a)U, + (X, — a). Indeed, your gain was at least b — a for every upcrossing 
and, in the last steps of your playing, you lose at most X, — a if Xj never crosses 
above b after you last resumed betting. But, since Y, is a martingale, we have 


E(Y,) = Yo = (b — a)E(U;) + E(X, — a) > (b — a)E(U,) — K — a. 


(We used the fact that X, > —|X,|, so that E (Xn) > —E(|X,]) = —K. This shows 
that E(U,) < B = (K + Yo + a)/(b — a) < co. Letting n > co, since U, + U, 
where U is the total number of upcrossings of the interval [a, b], it follows by the 
monotone convergence theorem that E(U) < B. Consequently, U is finite. Thus, 
X, cannot up-cross any given interval [a, b] infinitely often. 

Consequently, the probability that it up-crosses infinitely often any interval with 
rational limits is zero (since there are countably many such intervals). 

This implies that X,, must converge, either to +00, —oo, or to a finite value. 
Since E(|X,|) < K, the probability that X„ converges to +00 or —oo is zero. oO 


Fig. 15.16 If X, does not 
converge, there are some 
rational numbers a « b such 
that X, crosses the interval 
[a, b] infinitely often 
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The following is a direct but useful consequence. We used this result in the proof 
of the convergence of the stochastic gradient projection algorithm (Theorem 12.2). 


Theorem 15.13 (L?-Bounded Martingales Convergence) Let X, be a L?- 
bounded martingale, i.e., such that E(X2) x K?, Vn > 0, then X, — Xoo, almost 
surely, for some finite random variable X oo. 

| 
Proof We have 


E(\Xn|)? < E(X2) < K?, 


by Jensen’s inequality. Thus, it follows that E(|X,|) < K for all n, so that the result 
of the theorem applies to this martingale. o 


One can also show that E(|X,, — Xe) — 0. 


15.9.3 Law of Large Numbers 


The SLLN can be proved as an application of the convergence of martingales, as 
Doob (1953) showed. 


Theorem 15.14 (SLLN) Let (X;,,n > 1} be iid. random variables with 
E(\Xn|) = K < œ and E(X,) = uw. Then 


Xp cod Xs 
—— — — — — — u, almost surely as n — oo. 
n 
[| 
Proof Let 
Sp = Xi. Xunzl. 
Note that 
1 
EIX 1|Sn, Sn4i,---] = —Sn =: Yn, (15.12) 


n 
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by symmetry. Thus, 


E[Y_n | Sli... = E[E[X, | Sn, Sn+1, saa] | Sntis--] 
= E[X, | S41. ...] = Y-n-1. 


Thus, {..., Y-n—2, Y-n—1, Y-n, ...} is a martingale. (It is a Doob martingale.) This 
implies as before that the number U,, of upcrossings of an interval [a, b] is such that 
E(U,) < B < co. As before, we conclude that U := lim U, < oco, almost surely. 
Hence, Y, converges almost surely to a random variable Y_o. 


Now, since 
Yo = lim Se ie, 
n—> oo n 
we see that Y_oo is independent of (X1, ..., Xn) for any finite n. Indeed, the limit 


does not depend on the values of the first n random variables. However, since Y_oo 
is a function of (X,, n > 1}, it must be independent of itself, i.e., be a constant. 
Since E(Yo5) = E(Y1) = u, we see that Yoo = u. oO 


15.9.4 Wald’s Equality 


A useful application of martingales is the following. Let {X,,n > 1} be i.i.d. 
random variables. Let t be a random variable independent of the X,,’s that take 
values in (1, 2,...} with E(t) < oo. Then 


E(X1 ++ XD) = EQ)EQG). (15.13) 
This expression is known as Wald's Equality. 
To see this, note that Y, = X4 +---+ Xn —nE(X1) is a martingale. Also, T is 
a stopping time. Thus, 


E(Yran) cad E(Y1) = 0, 


which gives the identity with t replaced by t An. If E(t) < co, one can let n go to 
infinity and get the result. (For instance, replace X; by X r and use MCT, similarly 
for X; , then subtract.) 
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e General inference problems: guessing X given Y, Bayesian or not; 
e Sufficient statistic: h(Y) is sufficient for X; 

* Infinite Markov Chains: PR, NR, T; 

e Lyapunov-Foster Criterion; 

* Poisson Process: independent stationary increments; 
* Continuous-Time Markov Chain: rate matrix; 

* Shannon Capacity of BSC: typical sequences and random codes; 
* Bounds: Chernoff and Jensen; 

* Martingales and Convergence; 
* Strong Law of Large Numbers. 


15.10.1 Key Equations and Formulas 


Inference Problem 
Sufficient Statistic 
Infinite MC 

Poisson Process 
Continuous-Time MC 
Shannon Capacity C 
* "of BSC(p) 
Chernoff 

Jensen 

Martingales 

MG Convergence 
Wald 
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Guess X given Y: MAP, MLE, HT 
fyixlylx] = fa), sO) 

Irreducible = T, NR or PR 

Jumps w.p. Ae in next e seconds 

Jumps from i to j w. rate Q(i, j) 

Can transmit reliably at any rate R < C 
C = 1 + plogy(p) + (1 — p)logy(1 — p) 
P(X >a) x E(exp(0(X —a)), Y0 > 0 
h convex => E(h(X)) > h(E(X)) 

zero expected increase 

A.s. to finite RV if L! or L? bounded 
E(Xi1 c X) = EQ)EQXG) 


S.15.1 
D.15.1 
T.15.1 
D.15.2 
D.6.1 
S.15.7 
(15.2) 
(15.5) 
(15.6) 
D.15.3 
T.15.12 
(15.13) 


For the theory of Markov chains, see Chung (1967). The text Harchol-Balter (2013) 
explains basic queueing theory and many applications to computer systems and 


operations research. 


The book Bremaud (1998) is also highly recommended for its clarity and the 
breadth of applications. Information Theory is explained in the textbook Cover and 
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Thomas (1991). I learned the theory of martingales mostly from Neveu (1975). The 
theory of multi-armed bandits is explained in Cesa-Bianchi and Lugosi (2006). The 
text Hastie et al. (2009) is an introduction to applications of statistics in data science 
(Fig. 15.17). 


15.12 Problems 


Problem 15.1 Suppose that y;, ..., y, are iid. samples of N (u, o2). What is a 
sufficient statistic for estimating u given o = 1. What is a sufficient statistic for 
estimating o given u = 1? 


Problem 15.2 Customers arrive to a store according to a Poisson process with rate 
4 (per hour). 


(a) What is the probability that exactly 3 customers arrive during 1 h? 
(b) What is the probability that more than 40 min is required before the first 
customer arrives? 


Problem 15.3 Consider two independent Poisson processes with rates A, and A. 
Those processes measure the number of customers arriving in stores 1 and 2. 


(a) What is the probability that a customer arrives in store 1 before any arrives in 
store 2? 

(b) What is the probability that in the first hour exactly 6 customers arrive at the two 
stores? (The total for both is 6) 

(c) Given exactly 6 have arrived at the two stores, what is the probability all 6 went 
to store 1? 


Problem 15.4 Consider the continuous-time Markov chain in Fig. 15.17. 
(a) Find the invariant distribution. 


(b) Simulate the MC and see that the fraction of time spent in state 1 converges to 
z (1). 


Fig. 15.17. CTMC 
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Problem 15.5 Consider a first-come-first-served discrete-time queuing system 
with a single server. The arrivals are Bernoulli with rate A. The service times are 
iid. and independent of the arrival times. Each service time Z takes values in 
(1,2, ..., K} such that E(Z) = 1/u and X < u. 


(a) Construct the Markov chain that models the queue. What are the states and 
transition probabilities? [Hint: Suppose the head of the line task of the queue 
still requires z units of service. Include z in the state description of the MC.] 

(b) Use Lyapunov—Foster argument to show the queue is stable or equivalently the 
MC is positive recurrent. 


Problem 15.6 Suppose that random variable X takes value in the set (1, 2, ..., K} 
such that Pr(X; = k) = px > 0, and PL pk = 1. Suppose X1, X2,..., Xn isa 
sequence of n i.i.d. samples of X. 


(a) How many possible sequences exist? 
(b) How many typical sequences exist when n is large? 
(c) Find a condition that answers to parts (a) and (b) are the same. 


Problem 15.7 Let (N;,t > 0} be a Poisson process with rate A. Let Sn denote the 
time of the n-th event. Find 


(a) the pdf of Sy. 

(b) E[Ss]. 

(c) E[S4|N (1) = 2]. 

(d) E[N(4) — NDINA) = 3]. 


Problem 15.8 A queue has Poisson arrivals with rate A. It has two servers that work 
in parallel. When there are at least two customers in the queue, two are being served. 
When there is only one customer, only one server is active. The service times are 
iid. Exp(). 


(a) Argue that the queue length is a Markov Chain. 

(b) Draw the state transition diagram. 

(c) Find the minimum value of u so that the queue is positive recurrent and solve 
the balance equations. 


Problem 15.9 Let {X;, t > 0} be a continuous-time Markov chain with rate matrix 
Q = {q(i, j)}. Define q(i) = Vie 4. j). Let also 7; = inf{t > 0|X, = i} and 
S; = inf(t > 0|X; Æ i}. Then (select the correct answers) 


E[S;|Xo = i] =q@; 
PIT; < Tj|Xo = k] = qk, i)/(q(k. i) + q(k, j)) for i, j, k distinct; 


If a(k) = PIT; < T;|Xo = k], then a(k) = >, d eG) for k € (i, j}. 
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Problem 15.10 A continuous-time queue has Poisson arrivals with rate A, and it is 
equipped with infinitely many servers. The servers can work in parallel on multiple 
customers, but they are non-cooperative in the sense that a single customer can only 
be served by one server. Thus, when there are k customers in the queue, k servers are 
active. Suppose that the service time of each customer is exponentially distributed 
with rate jz and they are i.i.d. 


(a) Argue that the queue length is a Markov chain. Draw the transition diagram of 
the Markov chain. 

(b) Prove that for all finite values of à and u the Markov chain is positive recurrent 
and find the invariant distribution. 


Problem 15.11 Consider a Poisson process {N;,t > 0} with rate à = 1. Let 
random variable S; denote the time of the i-th arrival. [Hint: You recall that 


i= 


fsG) = SE Mx > 0).] 


(a) Given $3 = s, find the joint distribution of S; and $5. Show you work. 
(b) Find E[S5|$3 = s]. 
(c) Find E[S$3| N, = 2]. 


Problem 15.12 Let S = 3m X; denote the total amount of money withdrawn 
from an ATM in 8 h, where: 


(a) Xj are i.i.d. random variables denoting the amount withdrawn by each customer 
with E[X;] = 30 and Var[X;] = 400. 

(b) N is a Poisson random variable denoting the total number of customers with 
E[N] = 80. 


Find E[S] and Varl[S]. 


Problem 15.13 One is given two independent Poisson processes M; and N; with 
respective rates A and u, where à > u. Find E(t), where 


t = max(t > 0| M; € N; 4- 5). 
(Note that this is a max, not a min.) 


Problem 15.14 Consider a queue with Poisson arrivals with rate A. The service 
times are all equal to one unit of time. Let X; be the queue length at time ¢ (t > 0). 


(a) Is X; a Markov chain? Prove or disprove. 

(b) Let Y, be the queue length just after the n-th departure from the queue (n > 1). 
Prove that Y, is a Markov chain. Draw a state diagram. 

(c) Prove that Y, is positive recurrent when A < 1. 
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Problem 15.15 Consider a queue with Poisson arrivals with rate A. The queue can 
hold N customers. The service times are i.i.d. Exp(j1). When a customer arrives, 
you can choose to pay him c so that he does not join the queue. You also pay c when 
a customer arrives at a full queue. You want to decide when to accept customers to 
minimize the cost of rejecting them, plus the cost of the average waiting time they 
spend in the queue. 


(a) Formulate the problem as a Markov decision problem. For simplicity, consider 
a total discounted cost. That is, if x; customers are in the system at time f, then 
the waiting cost during [t, t + €] is e^? x,e. Similarly, if you reject a customer 
at time f, then the cost is ce P'. 

(b) Write the dynamic programming equations. 

(c) Use Python to solve the equations. 


Problem 15.16 The counting process N := (N;,0 < t < T} is defined as follows: 
Given t, {N;,0 € t < t} and {N; — N;, t € t < T} are independent Poisson 
processes with respective rates Ao and A1. 
Here, Ao and A; are known and such that 0 < Ag < A1. Also, t is exponentially 
distributed with known rate u > 0. 


1. Find the MLE of r given N. 
2. Find the MAP of x given N. 


Problem 15.17 Figure 15.18 shows a system where a source alternates between the 
ON and OFF states according to a continuous-time Markov chain with the transition 
rates indicated. When the source is ON, it sends a fluid with rate 2 into the queue. 
When the source is OFF, it does not send any fluid. The queue is drained at constant 
rate 1 whenever it contains some fluid. Let X; be the amount of fluid in the queue at 
time f > 0. 


(a) Plot a typical trajectory of the random process (X;,t > 0}. 

(b) Intuitively, what are conditions on A and u that should guarantee the “stability” 
of the queue? 

(c) Is the process (X;, t > 0) Markov? 


Fig. 15.18 The system (on) 2 
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Problem 15.18 Let {N;,¢ > 0} be a Poisson process with rate A that is exponen- 
tially distributed with rate jz > 0. 


(a) Find MLE[A|N;,0<s € t]; 

(b) Find MAP[A|N;s,0 <s x t]; 

(c) What is a sufficient statistic for A given (N,, 0 < s < t}; 

(d) Instead of A being exponentially distributed, assume that A is known to take 
values in [5, 10]. Give an estimate of the time t required to estimate A within 
5% with probability 95%. 


Problem 15.19 Consider two queues in parallel in discrete time with Bernoulli 
arrival processes of rates 4; and A5, and geometric service rates of uj and m2, 
respectively. There is only one server that can serve either queue 1 and queue 
2 at each time. Consider the scheduling policy that serves queue | at time 
n if u1Qi1(n) > nu2Q»(n), and serve queue 2 otherwise, where Qi(n) and 
Q2(n) are queue lengths of the queues at time n. Use the Lyapunov function 
V(Qi(n), Q2(n)) = Q7(n) + Q3(n) to show that the queues are stable if 1/41 + 
à2/u2 < 1. This scheduling policy is known as Max-Weight or Back-Pressure 
policy. 
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Topics: Symmetry, conditioning, independence, expectation, law of large 
numbers, regression. 


A.1 Symmetry 


The simplest model of probability is based on symmetry. Picture a bag with 10 
marbles that are identical, except that they are marked as shown in Fig. A.1. 

You put the marbles in a bag that you shake thoroughly and you then pick a 
marble without looking. Out of the ten marbles, seven have a blue number equal to 1. 
We say that the probability that you pick a marble whose blue number is 1 is equal to 
7/10 = 0.7. The probability is the fraction of favorable outcomes (picking a marble 
with a blue number equal to 1) among all the equally likely possible outcomes (the 
different marbles). The notion of "equally likely" is defined by symmetry, so this 
definition of probability is not circular. 

For ease of discussion, let us call B the blue number on the marble that you 
pick and R the red number on that marble. We write P(B = 1) = 0.7. Similarly, 
P(B = 2) = 0.3, P(R = 1) = 02, P(B = 2,R = 3) = 0.1, etc. We call R 
and B random variables. The justification for the terminology is that if we were to 
repeat the experiment of shaking the bag of ten marbles and picking a marble, the 
values of R and B would vary from experiment to experiment in an unpredictable 
way. Note that R and B are functions of the same outcome (the selected marble) of 
one experiment (picking a marble). Indeed, we do not pick one marble to read the 
value of B and then another one to read the value of R; we pick only one marble 
and the values of B and R correspond to that marble. 

Let A be a subset of the marbles. We also write P(A) for the probability that 
you pick a marble that is in the set A. Since all the marbles are equally likely to 
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Fig. A.1 Ten marbles marked with a blue and a red number 


be picked, P(A) = |A|/10, where |A| is the number of marbles in the set A. For 
instance, if A is the set of marbles where (B = 1, R = 3) or (B = 2, R = 4), then 
P(A) = 0.4 since there are four such marbles out of 10. 

It is clear that if A, and A» are disjoint sets (i.e., have no marble in common), 
then P(A; U A5) = P(A1) + P(A»). Indeed, when A, and A» are disjoint, the 
number of marbles in A; U A» is the number of marbles in A, plus the number 
of marbles in A». If we divide by ten, we conclude that the probability of picking 
a marble that is in A; U A» is the sum of the probabilities of picking one in A, 
or in A». We say that probability is additive. This property extends to any finite 
collection of events that are pairwise disjoint. 

Note that if A; and A» are not disjoint, then P(A, U A2) < P(A1) + P(Az2). For 
instance, if A, is the set of marbles such that B = 1 and A» is the set of marbles such 
that R = 4, then P(A; U A5) = 0.9, whereas P(A4) + P(A2) = 0.7+ 0.5 = 1.2. 
What is happening is that P(A1) + P(A2) is double-counting the marbles that are 
in both A, and A3, i.e., the marbles such that (B = 1, R = 4). We can eliminate 
this double-counting and check that P(A; U A5) = P(A1) + P(A2) — P (A1 NO A2). 

Thus, one has to be a bit careful when examining the different ways that 
something can happen. When adding up the probabilities of these different ways, 
one should make sure that they are exclusive, i.e., that they cannot happen together. 
For example, the probability that your car is red or that it is a Toyota is not the 
sum of the probability that it is red plus the probability that it is a Toyota. This sum 
double-counts the probability that your car is a red Toyota. Such double-counting 
mistakes are surprisingly common. 


A.2 Conditioning 


Now, imagine that you pick a marble and tell me that B = 1. How do I guess R? 

Looking at the marbles, we see that there are 7 marbles with B = 1, among 
which two are such that R — 1. Thus, given that B — 1, the probability that R — 1 
is 2/7. Indeed, given that B = 1, you are equally likely to have picked any one of 
the 7 marbles with B — 1. Since 2 out of these 7 marbles are such that R — 1, we 
conclude that the probability that R — 1 given that B — 1 is 2/7. 

We write P[R = 1| B = 1] = 2/7. Similarly, P[R = 3| B = 1] = 2/7 
and P[R = 4| B = 1] = 3/7. We say that P[R = 1 | B = 1] is the conditional 
probability that R = 1 given that B = 1. 

So, we are not sure of the value of R when we are told that B = 1, but the 
information is useful. For instance, you can see that P[R = 1 | B = 2] = 0, 
whereas P[R = 1 | B = 1] = 2/7. Thus, knowing B tells us something about R. 
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Observe that P(R = 1, B = 1) = N(1,1)/10 where N(1, 1) is the number 
of marbles with R = 1 and B = 1. Also, P[R = 1| B = 1] = NC, D/N (1, *) 
where N (1, x) is the number of marbles with B = 1 and R taking an arbitrary value. 
Moreover, P(B = 1) = N(1,x)/N. It then follows that 


P(B=1,R=1) = P(B = 1) x P[R = 1 | B = 1]. 


Indeed, 


NG, D NA,» NO, 1) 
N N "NGO,»9 


To make the previous identity intuitive, we argue in the following way. For (B = 
1, R = 1) to occur, B = 1 must occur and then R = 1 must occur given that B = 1. 
Thus, the probability of (B = 1, R = 1) is the probability of B = 1 times the 
probability of R = 1 given that B = 1. 

The previous identity shows that 


P(B=1,R=1) 
PIR = 1 | B = 1] = — 
P(B =1) 
Intuitively, this expression says that the probability of R = 1 given B = 1 is the 
fraction of marbles with B = 1 and R = 1 among the marbles with B = 1. 
More generally, for any two values b and r of B and R, one has 


p P(B=b,R=r) 


and 


P(B =b, R =r) = P(B = b)P[R =r | B =b]. (A.2) 


The most likely value of R given that B = 1 is 4. Indeed, P[R = 4 | B = 1] 
is larger than P[R = 1| B = 1] and P[R = 3| B = 1]. We say that 4 is the 
maximum a posteriori (MAP) estimate of R given that B — 1. Similarly, the MAP 
estimate of B given that R — 4 is 1. Indeed, P[B — 1 | R — 4J — 3/5, which is 
larger than P[B = 2| R = 4] = 2/5. 

A slightly different concept is the maximum likelihood estimate (MLE) of B given 
R — 4. By definition, this is the value of B that makes R — 4 most likely. We see 
that the MLE of B given that R = 4 is 2 because P[R = 4 | B = 2] = 2/3 > 
P[R=4|B=1). 

For instance, the MLE of a disease of a person with high fever might be Ebola, 
but the MAP may be a common flu. To see this, imagine 100 marbles out of which 
one is marked (Ebola, High Fever), 15 are marked (Flu, High Fever), 5 are marked 
(Flu, Low Fever), and the others are marked (Something Else, No Fever). For this 
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model, we see that P[High Fever | Ebola] = 1 > P[High Fever | Flu] = 0.75, and 
P[Flu | High Fever] = 15/16 > P[Ebola | High Fever] = 1/16. 


A.3 Common Confusion 


The discussion so far probably seems quite elementary. However, most of the 
confusion about probability arises with these basic ideas. Let us look at some 
examples. 

You are told that Bill has two children and one of them is named Isabelle. 
What is the probability that Bill has two daughters? You might argue that Bill's 
other child has a 50% probability of being a girl, so that the probability that 
Bill has two daughters must be 0.5. In fact, the correct answer is 1/3. To see 
this, look at the four equally likely outcomes for the sex of the two children: 
(M, M), CF, M), M, F), CF, F) where (M, F) means that the first child is male 
and the second is female, and similarly for the other cases. Out of these four 
outcomes, three are consistent with the information that “one of them is named 
Isabelle.’ Out of these three outcomes, one corresponds to Bill having two daugh- 
ters. Hence, the probability that Bill has two daughters given that one of his two 
children is named Isabelle is 1/3, not 50%. 

This example shows that confusion in Probability is not caused by the sophis- 
tication of the mathematics involved. It is not a lack of facility with Calculus 
or Algebra that causes the difficulty. It is the lack of familiarity with the basic 
formalism: looking at the possible outcomes and identifying precisely what the 
given information tells us about these outcomes. 

Another common source of confusion concerns chance fluctuations. Say that you 
flip a fair coin ten times. You expect about half of the outcomes to be tails and half 
to be heads. Now, say that the first six outcomes happen to be heads. Do you think 
the next four are more likely to be tails, to catch up with the average? Of course not. 
After 4 years of drought in California, do you expect the next year to be rainier than 
average? You should not. 

Surprisingly, many people believe in the memory of purely random events. A 
useful saying is that “lady luck has no memory nor vengeance.” 

A related concept is "regression to the mean." A simple example goes as follows. 
Flip a fair coin twenty times. Say that eight of the first ten flips are heads. You 
expect the next ten flips to be more balanced. This does not mean that the next ten 
flips are more likely to be tails to compensate for the first ten flips. It simply means 
that the abnormal fluctuations in the first ten flips do not carry over to the next ten 
flips. More subtle scenarios of this example involve the stock market or the scores of 
sports teams, but the basic idea is the same. Of course, if you do not know whether 
the coin is fair or biased, observing eight heads out of the first ten flips suggests that 
the coin is biased in favor of heads, so that the next ten coin flips are likely to give 
more heads than tails. But, if you have observed many flips of that coin in the past, 
then you may know that it is fair, and in that case regression to the mean makes 
sense. 
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Now, you might ask “how can about half of the coin flips be heads if the coin 
does not make up for an excessive number of previous tails?" The answer is that 
among the 2? = 1024 equally likely strings of 10 heads and tails, a very large 
proportion have about 5096 of heads. Indeed, 672 such strings have either 4, 5, or 
6 heads. Thus the probability that the number of heads is between 40 and 6096 is 
672/1,024 = 65.6%. This probability gets closer to one as you flip more coins. 
For twenty coins, the probability that the fraction of heads is between 40 and 60 is 
73.596. 

To avoid being confused, always keep the basic formalism in mind. What are the 
outcomes? How likely are they? What does the known information tell you about 
them? 


A.4 Independence 


Look at the marbles in Fig. A.2. For these marbles, we see that P[R = 1| B = 
1]2P[R-3|B-—1] = 2/4 = 0.5. Also, P[R = 1 | B = 2] = P[R=3|B= 
2] = 3/6 = 0.5. Thus, knowing the value of B does not change the probability of 
the different values of R. We say that for this experiment, R and B are independent. 
Here, the value of R tells us something about which marble you picked, but that 
information does not change the probability of the different values of B. 

In contrast, for the marbles in Fig. A.1, we saw that P[R = 1| B = 2] = 0 
and P[R = 1 | B = 1] = 2/7 so that, for that experiment, R and B are not 
independent: knowing the value of B changes the probability of R = 1. That is, B 
tells you something about R. This is rather common. The temperature in Berkeley 
tells us something about the temperature in San Francisco. If it rains in Berkeley, it 
is likely to rain in San Francisco. 

This fact that observations tell us something about what do not observe directly 
is central in applied probability. It is at the core of data science. What information 
do we get from data? We explore this question later in this appendix. 

Summarizing, we say that B and R are independent if P[R = r | B = 
b] = P(R = r) for all values of r and b. In view of (A.2), B and R are 
independent if 


P(B =b, R =r)= P(B =b)P(R =r), forall b, r. (A.3) 
As a simple example of independence, consider ten flips of a fair coin. Let X be 


the number of heads in the first 4 flips and Y the number of heads in the last 6 flips. 
We claim that X and Y are independent. Intuitively, this is obvious. However, how 
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Fig. A.2 Ten other marbles marked with a blue and a red number 
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do we show this formally? In this experiment, an outcome is a string of ten heads 
and tails. These outcomes are all equally likely. Fix arbitrary integers x and y. Let A 
be the set of outcomes for which X = x and B the set of outcomes for which Y = y. 
Figure A.3 illustrates the sets A and B. In the figure, the horizontal axis corresponds 
to the different strings of the first four flips; the vertical axis corresponds to the 
different strings of the last six flips. 

The set A corresponds to a different strings of the first four flips that are such 
that X = x and arbitrary values of the last six flips. Similarly, B corresponds to b 
different strings of the last six flips and arbitrary values of the first four flips. Note 
that A has a x 2° outcomes since each of the a strings corresponds to 2° strings of 
the last 6 flips. Similarly, B has 2^ x b outcomes. Thus, 


ax2 a 2xb b 
P(A)= 510 = z and P(B) = 50 = 56° 
Moreover, 
axb 
P(ANB)= 38 
Hence, 


P(An B) = P(A) x P(B). 


If you look back at the calculation, you will notice that it boils down to the area of a 
rectangle being the product of the sides. The key observation is then that the set of 
outcomes where X = x and Y = y is a rectangle. This is so because X = x imposes 
a constraint on the first four flips and Y — y imposes a constraint on the other flips. 
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A.5 Expectation 


Going back to the marbles of Fig. A.1, what do you expect the value of B to be? How 
much would you be willing to pay to pick a marble given that you get the value of B 
in dollars? An intuitive argument is that if you were to repeat the experiment 1000 
times, you should get a marble with B = 1 about 70% of the time, i.e., about 700 
times. The other 300 times, you would get B — 2. The total amount should then be 
1 x 700 + 2 x 300. The average value per experiment is then 


1 x 700+ 2 x 300 


=1 1+2 3. 
1000 x 0.7 4-2 x 0.3 


We call this number the expected value of B and we write it E(B). Similarly, we 
define 


E(R) = 1 x 0.2 +3 x 0.3 + 4 x 0.5. 


Thus, the expected value is defined as the sum of the values multiplied by 
their probability. The interpretation we gave by considering the experiment being 
repeated a large number of times is only an interpretation, for now. 

Reviewing the argument, and extending it somewhat, let us assume that we have 


N marbles marked with a number X that takes the possible values (x1, xo, ..., xm} 
and that a fraction pm of the marbles are marked with X = xm, for m = 1,..., M. 
Then we write P(X = xm) = Pm for m = 1,..., M. We define the expected value 
of X as 
M M 
E(X) — X &mpm = À APO = xm). (A.4) 
m=1 m=1 


Consider a random variable X that is equal to the same constant x for every 
outcome. For instance, X could be a number on a marble when all the marbles are 
marked with the same number x. In this case, 


E(X)=x x P(X-—x)-—x. 


Thus, the expected value of a constant is the constant. For instance, if we designate 
by a arandom variable that always takes the value a, then we have 


E(a) — a. 


There is a slightly different but very useful way to compute the expectation. We 
can write E (X) as the sum over all the possible marbles we could pick of the product 
of X for that marble times the probability 1/N that we pick that particular marble. 
Doing this, we have 
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y 1 
E(X) = 5 X00. 


n=1 


where X (n) is the value of X for marble n. This expression gives the same value as 
the previous calculation. Indeed, in this sum there are pj, N terms with X (n) = Xm 
because we know that a fraction pm ofthe N marbles, i.e., pm N marbles, are marked 
with X = xm. Hence, the sum above is equal to 


M 1 M 
2 Um Non y = > PmXm, 


ml m=1 


which agrees with the previous expression for E(X). 

This latter calculation is useful to show that E(B + R) = E(B) + E(R) in 
our example of Fig. A.1 with ten marbles. Let us examine this important property 
closely. You pick a marble and you get B 4- R. By looking at the marbles, you see 
that you get 1 4- 1 if you pick the first marble, and so on. Thus, 


1 1 
E(B-R)-(uc-D—-c:-QrA—. 
[Esp eec rr earn) ae 


If we decompose this sum by regrouping the values of B and then those of R, we 
see that 


1 1 1 1 
E(B+R)= tts sto] 
The first sum is E(B) and the second is E(R). Thus, the expected value of a sum is 
the sum of the expected values. We say that expectation is linear. Notice that this is 
so even though the values B and R are not independent. 
More generally, for our N marbles, if marble n is marked with two numbers X (n) 
and Y (n), then we see that 


> t= 1 < l 
E(CEY) =) AMIDE X60 DVT EOE). 
n=1 


n=1 n=1 


(A.5) 
Linearity shows that if we get 5 + 3X? + 4Y? when we pick a marble marked 
with the numbers X and Y, then 


E(5 -3X? + 4AY)) = 5 E 3E(X?) +4E (Y°). 


Indeed, 
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N 
E(5 + 3X* +4Y3)) = x EC 4- 3X? (n) + AY?(n)) = 5 + 3E(X?) + AE(Y)). 


n=1 


As another example, we have 
E(a + X) = E(a) + E(X) =a + E(X). 
Similarly, 
E((X =ay) = E(X? — 2a X + a°) = E(X?) — 2a E(X) Ea. 
Choosing a = E(X) in the previous example, we find 
E(X — EQOY) = E(X?) — 2E(X)E(X) + [E(X))° = E(X’) - LEGO, 


an example we discuss in the next section. 

There is another elementary property of expectation that we use in the next 
section. Consider marbles where B < R, such as the marbles in Fig. A.1. Compare 
the sum over the marbles of B (n) x (1/10) with the sum of R(n) x (1/10). Term by 
term, the first sum is less than the second, so that E(B) < E(R). Hence, if B < R 
for every outcome, one has E(B) < E(R). We say that expectation is monotone. 

We will use yet another simple property of expectation in the next section. 
(Do not worry, there are not many such properties!) Assume that X and Y are 
independent. Recall that this means that P(X = x;, Y = yj) = P(X = x;)P (Y = 
yj) for all possible pair of values (x;, yj) of X and Y. Then we claim that 


E(XY) = E(X)E(Y). (A.6) 


To see this, we write 


a 1 1 
E(XY) 2) XMM =D) ING Duc. 
J 


n=1 i 


where N(i, j) is the number of marbles marked with (x;, yj). We obtained the 
last term by regrouping the terms based on the values of X(n) and Y(n). Now, 
Nii, j)/N = P(X = xi, Y = yj). Also, by independence, P(X = xi, Y = yj) = 
P(X = x;)P(Y = yj). Thus, we can write the sum above as follows: 


E(XY) = 3 Y xiyj P(X = x, Y =yj) = 32 xy P(X = xi) PY = yj). 
i j i J 


We now compute the sum on the right by first summing over j. We get 
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E(XY)= e» X ux P(X =x;)P(Y = y;) | = So xi P(X = xj) 


i J 
x S a PU yj) 
J 


We got the last expression by noticing that the factor x; P(X = xj) is common 
to all the terms x;x; P(X = x;) P(Y = yj) for different values of j when i is fixed. 
Now, the term between brackets is E(Y). Hence, we find 


E(XY) = 3 x; P(X =x) E(Y) = E(X)E(Y), 


as claimed. Thus, if X and Y are independent, the expected value of their product is 
the product of their expected values. 

We can check that this property does not generally hold if the random variables 
are not independent. For instance, consider R and B in Fig. A.1. We find that 


E(BR) = (14+14+34+34+44+44+44+6+8+4 8)/10 = 3.4, 
whereas 
E(B) = 1.3 and E(R) = 3.1, 
so that E(B R) = 3.4 Z E(B)E(R) = 1.3 x 3.4 = 4.42. 
We invite you to construct an example of marbles where R and B are not 
independent and yet E(BR) — E(B)E(R). This example will convince you that 


E(XY) — E(X)E(Y) does not imply that X and Y are independent. 
We have seen the following properties of expectation: 


* expectation is linear 

* expectation is monotone 

* the expected value of the product of two independent random variables is the 
product of their expected values. 

A.6 Variance 


Let us define the variance of a random variable X as 


var(X) = E((X — E(X)))). (A.7) 
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The variance measures the variability around the mean. If the variance is small, the 
random variable is likely to be close to its mean. 
By linearity of expectation, we have 


var(X) = E(X? — 2X E(X) + [EGOP) = E(X?) — 2E(X)E(X) + LEOOP. 
Hence, 
var(X) = E(X?) —[E(X)P’, (A.8) 


as we saw already in the previous section. 
Note also that 


var(aX) = E((aX))-[E(aX) = a? E(X?)-a^[EQOF = a?EQO)-LE GOTT. 
Hence, 
var(aX) = a?var(X). (A.9) 
Now, assume that X and Y are independent random variables. Then we find that 
var(X + Y) = E(X + Y)) - [E(X + Y)? 
= E(X? - 2XY + Y’) - [EQOP — 2E(X)E(Y) - LEQ)Y 
= E(X2) — [E(X)f + E(Y2) - [E(Y)? + 2E(XY) — 2E(X)E(Y). 
Therefore, if X and Y are independent, 
var(X + Y) = var(X) + var(Y), (A.10) 


where the last expression results from the fact that E(XY) = E(X)E(Y) when the 
random variables are independent. 

The square root of the variance is called the standard deviation. 

Summing up, we saw the following results about variance: 


* when one multiplies a random variable by a constant, its variance gets multiplied 
by the square of the constant 

* the variance of the sum of independent random variables is the sum of their 
variances 

* the standard deviation of a random variable is the square root of its variance. 
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A.7 Inequalities 


The fact that expectation is monotone yields some inequalities that are useful to 
bound the probability that a random variable takes large values. Intuitively, if a 
random variable is likely to take large values, its expected value is large. 

The simplest such inequality is as follows. Let X be a random variable that is 
always non-negative, then 


E(X) 
P(X >a) x ——, fora >Q. 
a 


This is called Markov's inequality. To prove it, we define the random variable Y as 
being 0 when X < a and 1 when X > a. Hence, 


E(Y) 20 x P(X «a) - 1 x P(X >a) = P(X >a). 


We note that Y < X/a. Indeed, that inequality is immediate if X < a because then 
Y = 0 and X/a > 0. It is also immediate when X > a because then Y = 1 and 
X/a > 1. Consequently, by monotonicity of expectation, E(Y) < E(X/a). Hence, 
X E(X) 
P(X za)= EY) <E|— |=—_, 
a a 
where the last equality comes from the linearity of expectation. 
The second inequality is Chebyshev’s inequality. It states that 


var(X) 


e 


P(X — E(X)| > €) < , fore > 0. 


To derive this inequality, we define Z = |X — E(X)|^. Markov’s inequality says 
that 
E(Z) _ vară 


2 
MISES TU. 


Now, Z > €? is equivalent to |X — E (X)| > e. This proves the inequality. 


A.8 Law of Large Numbers 


Chebyshev’s inequality is particularly useful when we consider a sum of inde- 
pendent random variables. Assume that X1, X2,..., X, are independent random 
variables with the same expected value u and the same variance o?. Define Y = 
(X1 +---+ X4)/n. Observe that 
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ga - E( ps F 1) = hay, 
n n 


Also, 


1 1 E 
var(Y) — Se ge m x no^ = Ea 


Consequently, Chebyshev's inequality implies that 


c? 


P(Y — u| > €) < —5. 
ne 
This probability becomes arbitrarily small as n increases. Thus, if Y is the 
average of n random variables that are independent and have the same mean u 
and the same variance, then Y is very close to u, with a high probability, when n is 
large. This is called the Weak Law of Large Numbers. 
Note that this result extends to the case where the random variables are 


independent, have the same mean, and have a variance bounded by some o?. 


A.9 Covariance and Regression 


Consider once again the N marbles with the numbers (X, Y). We define the 
covariance of X and Y as 


cov(X, Y) = E(XY) — E(X)E(Y). 


By linearity of expectation, one can check that cov(XY) = E((X — E(X))(Y — 
E (Y)). This expression suggests that cov(X, Y) is positive when X and Y tend to 
be large or small together. We make this idea more precise in this section. 

Observe that if X and Y are independent, then E(XY) = E(X)E(Y), so 
that cov(X, Y) = 0. Two random variables are said to be uncorrelated if their 
covariance is zero. Thus, independent random variables are uncorrelated. The 
converse is not true. 

Say that you observe X and you want to guess Y, as we did earlier with R and B. 
For instance, you observe the height of a person and you want to guess his weight. 
To do this, you choose two numbers a and b and you estimate Y by Y where 


^ 


Y —a- E(Y) - b(X — E(X)). 


The goal is to choose a and b so that Y tends to be close to Y. Thus, Y can be an 
arbitrary linear function of X and we want to find the best linear function. We wrote 
the linear function in this particular form to simplify the subsequent algebra. 
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To make this precise, we look for the values of a and b that minimize E((Y — 
yy ). That is, we want the error Y — Y to be small, on average. We consider the 
square of the error because this is much easier to analyze than choosing the absolute 
value of the error. 

Now, 


(Y-f$y-[Y-E(Y)-a-b(x-E(X)P 


— a? + (Y — EQ) + P(x — E(X)? 
2a(Y — E(Y)) — 2b(Y — E(Y))(X — E(X)) + 2ab(X — E(X)). 


Taking the expected value, we get 
E(Y — Y) =a? + var (Y) + b’var(X) — 2bcov(X, Y). 


To do the calculation, we used the linearity of expectation and the facts that the 
expected values of X — E(X) and Y — E(Y) are equal to zero. To minimize this 
expression over a, we should choose a — 0. To minimize it over b, we set the 
derivative with respect to b equal to zero and we find 


2bvar(X) — 2cov(X, Y) = 0, 
so that b = cov(X, Y)/var(X). Consequently, 


cov(X, Y) 


= E(Y erc om X) 


(X — E(X)). 


We call P the Linear Least Squares Estimate (LLSE) of Y given X. It is the linear 
function of X that minimizes the mean squared error with Y. 
As an example, consider again the N marbles. There, 


je la 
Bob EE q 20 


n=1 
LÀ 
var(X) — x OU X?(n) - [E (X)? 


N 
var(Y) — s p» Y?(n) - [E(Y)? 


1 N 
cov(X, Y) = 5 XO XY (n) - EGO EQ). 


n=1 
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In this case, one calls the resulting expression for Y the linear regression of Y 
against X. The linear regression is the same as the LLSE when one considers that 
(X, Y) are random variables that are equal to a given sample (X (n), Y (n)) with 
probability 1/N for n = 1,..., N. That is, to compute the linear regression, one 
assumes that the sample values one has observed are representative of the random 
pair (X, Y) and that each sample is an equally likely value for the random pair. 


A.10 Why Do We Need a More Sophisticated Formalism? 


The previous sections show that one can get quite far in the discussion of probability 
concepts by considering a finite set of marbles, and random variables that can only 
take finitely many values. In engineering, one might think that this is enough. One 
may approximate any sensible quantity with a finite number of bits, so for all 
applications one may consider that there are only finitely many possibilities. All 
that is true, but results in clumsy models. For instance, try to write the equations 
of a falling object with discretized variables. The continuous versions are usually 
simpler than the discrete ones. As another example, a Gaussian random variable is 
easier to work with than a binomial or Poisson random variable, as we will see. 

Thus we need to extend the model to random variables that have an infinite, even 
an uncountable set of possible values. Does this step cause formidable difficulties? 
Not at the intuitive level. The continuous version is a natural extension of the 
discrete case. However, there is a philosophical difficulty in going from discrete 
to continuous. Some thinkers do not accept the idea of making an infinite number 
of choices before moving on. That is, say that we are given an infinite collections 
(Ai, A2,...} of nonempty sets. Can we reasonably define a new set B that contains 
one element from each An? We can define it, but if there is no finite way of building 
it, can we assume that it exists? A theory that does not rely on this axiom of choice 
is considerably more complex than those that do. The classical theory of probability 
(due to Kolmogorov) accepts the axiom of choice, and we follow that theory. 

One key axiom of probability theory enables to define the probability of a set A 
of outcomes as a limit of that of simpler sets A, that approach A. This is similar 
to approximating the area of a circle by the sum of the areas of disjoint rectangles 
that approach it from inside, or approximating an integral by a sum of rectangles. 
This key axiom says that if Ay C A» C A3 C --- and A = UnAn, then P(A) = 
lim P(A;). Thus, if sets A; approximate A from inside in the sense that these sets 
grow and eventually contain every point of A, then the probability of A is equal to 
the limit of the probability of An. This is a natural way of extending the definition 
of probability of simple sets to more complex ones. The trick is to show that this 
is a consistent definition in the sense that different approximating sequences of sets 
must have the same limiting probability. 

This key axiom enables to prove the strong law of large numbers. That law states 
that as you keep on flipping coins, the fraction of heads converges to the probability 
that one coin yields heads. Thus, not only is the fraction of heads very likely to be 
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close to that probability when you flip many coins, but in fact the fraction gets closer 
and closer to that probability. This property justifies the frequentist interpretation of 
probability of an event as the long-term fraction of time that event occurs when one 
repeats the experiment. This is the interpretation that we used to justify the definition 
of expected value. 


A.11 References 


There are many useful texts and websites on elementary probability. Readers might 
find Walrand (2019) worthwhile, especially since it is free on Kindle. 


A.12 Solved Problems 


Problem A.1 You have a bag with 20 red marbles and 30 blue marbles. You shake 
the bag and pick three marbles, one at a time, without replacement. What is the 
probability that the third marble is red? 


Solution As is often the case, there is a difficult and an easy way to solve this 
problem. The difficult way is to consider the first marble, then find the probability 
that the second marble is red or blue given the color of the first marble, then find 
the probability that the third marble is red given the colors of the first two marbles. 

The easy way is to notice that, by symmetry, the probability that the third marble 
is red is the same as the probability that the first marble is red, which is 20/50 = 0.4. 

It may be useful to make the symmetry argument explicit. Think of the marbles as 
being numbered from 1 to 50. Imagine that shaking the bag results in some ordering 
in which the marbles would be picked one by one out of the bag. All the orderings 
are equally likely. Now think of interchanging marble one and marble three in each 
ordering. You end up with a new set of orderings that are again equally likely. In 
this new ordering, the third marble is the first one to get out of the bag. Thus, the 
probability that the third marble is red is the same as the probability that the first 
marble is red. 


Problem A.2 Your applied probability class has 275 students who all turn in their 
homework assignment. The professor returns the graded assignments in a random 
order to the students. What is the expected number of students who get their own 
assignment back? 


Solution The difficult way to solve the problem is to consider the first assignment, 
then the second, and so on, and for each to explore what happens if it is returned to 
its owner or not. The probability that one student gets her assignment back depends 
on what happened to the other students. It all seems very complicated. 

The easy way is to argue that, by symmetry, the probability that any given student 
gets his assignment back is the probability that the first one gets his assignment 
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back, which is 1/275. Let then Xn = 1 if student n gets his/her own assignment and 
X, = 0 otherwise. Thus, E(X4,) = 1/275. The number of students who get their 
assignment back is X +---+ X275. Now, by linearity of expectation, E(X14- --- 
X275) = E(X1) + t E(X275) = 275 x (1/275) = 1. 


Problem A.3 A monkey types a sequence of one million random letters on a 
typewriter. How many times does the name “walrand” appear in that sequence? 
Assume that the typewriter has 40 keys: the 26 letters and 14 other characters. 


Solution The easy solution uses the linearity of expectation. Let X, = 1 if the name 
“walrand” appears in the sequence, starting at the n-th symbol of the string. The 
number of times that the name appears is then Z = X\+---+Xy with N = 109 — 6. 
By symmetry, E(X4,) = E(X1) for all n. Now, the probability that X = 1 is equal 
to the probability that the first symbol is w, that the second symbol is a, and so on. 
Thus, E(X1) = P(X, = 1) = (1/40)’. Hence, the expected number of times that 
“walrand” appears is E(Z) = (10° — 6) x (1/40)? ~ 6 x 1076. So, it is true that 
a monkey could eventually type one of Shakespeare’s plays, but he is likely to die 
before succeeding. 
Note that Markov’s inequality implies that 


P(Z > 1) < E(Z) ¥6~x 107°. 


Problem A.4 You flip a fair coin n times and the fraction of heads is Y. How large 
does n have to be to be sure that P(|Y — 0.5| > 0.05) < 0.05? 


Solution We use Chebyshev's inequality that states 


var(Y) 
P(|lY —0.5| > €) < x 

€ 
We saw in our discussion of the weak law of large numbers that var(Y) — 
var(X1)/n where X, = 1 if the first coin yields heads and X4 = 0 otherwise. Since 
P(X, = 1) = P(X = 0) = 0.5, we find that E(X1) = 0.5 and E(X?) = 0.5. 
Hence, var(X1) = E(X?) — [E(X)P = 0.25. Consequently, 


0.25 . 100 
nx25x104 n` 


P(|Y — 0.5| > 0.05) < 


Thus, the right-hand side is 0.05 if n = 2, 000. You have to be patient. . . . 


Problem A.5 What is the probability that two friends share a birthday? What about 
three friends? What about n? How large does n have to be for this probability to be 
5096? 
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Solution /n the case of two friends, it is the probability that the second has the same 
birthday as the first, which is 1/365 (ignoring February 29). 

The case of three friends looks more complicated: two of the three or all of them 
could share a birthday. It is simpler to look at the probability that they do not share 
a birthday. This is the probability that the second friend does not have the same 
birthday as the first, which is 364/365 times the probability that the third does not 
share a birthday with the first two, which is 363/365. Let us explore this a bit further 
to make sure we fully understand this solution. First, we consider all the strings of 
three numbers picked in (1, 2, ..., 365). There are 365? such strings because there 
are 365 choices for the first number, then 365 for the second, and finally 365 for 
the third. Second, consider the strings of three different numbers from the same set. 
There are 365 choices for the first, then 364 for the second, then 363 for the third. 
Hence, there are 365 x 364 x 363 such strings. Since all the strings are equally 
likely to be picked (a reasonable assumption), the probability that the friends do not 
share a birthday is 


365 x 364 x 363 
365 x 365 x 365° 


The case of n friends is then clear: they do not share a birthday with probability 
p where 


_ 365 x364x...x (365 —n + 1) 
B 365 x 365 x --- x 365 


1 2 n—l1 
=1x(1l-—-—]x{1-—)~x::-x]1- . 
( xs) ( ss) ( s) 


To evaluate this expression, we use the fact that | — x ~ exp{—x} when |x| « 
1. We use this fact repeatedly in this book. Do not worry, there are not too many 
such tricks. In practice, this approximation is good for |x| < 0.1. For instance, 
exp{—0.1} ~ 0.90483. Thus, assuming that n/365 < 0.1, i.e., n < 36, we find 


ps ixep|-xs x exp] sec} x x exp] So} 
365 365 365 
| 1 2 - | | 
= exp ud = exp 
365 365 365 365 
u (n — 1)n 
= op |- 730 | 
For instance, with n = 24, we find 
xs 23x24| |. 
p= exp - 730 | ~os. 
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Hence, the probability that at least two friends in a group of 24 share a birthday 
is about 50%. This result is somewhat surprising because 24 is small compared to 
365. One calls this observation the birthday paradox. Many people think that it takes 
about 365/2 ~ 180 friends for the probability that they share a birthday to be 50%. 
The paradox is less mysterious when you think of the many ways that friends can 
share birthdays. 


Problem A.6 You throw M marbles into B bins, each time independently and in 
a way that each marble is equally likely to fall into each bin. What is the expected 
number of empty bins? What is the probability that no bin contains more than one 
marble? 


Solution The first bin is empty with probability a :— [(B —1)/ B]M, and the same is 
true for every bin. Hence, if Xy = 1 when bin b is empty and Xy = 0 otherwise, we 
see that E(Xy) = a. Hence, the expected value of the number Z = X1------- Xp 
of empty bins is equal to 


E(Z) = BE(Xi) = Ba. 


To evaluate this expression we use the following approximation: 


as N 
(i- =) & exp{—a} for N > 1. 


This approximation is already quite good for N = 10 and 0 < a < 1. For instance, 
(1 — 1/10)!° = 0.35 and exp{—1} ~ 0.37. Hence, if M = BB, one can write 


o = (1— 1/B)" = (1— B/ M)" ~ exp(—f], 
so that 
E(Z) © B exp(— f]. 


For instance, with M = 20 and B = 30, one has B = 2/3 and E(Z) ~X 
30exp{—2/3} ~ 15. That is, the 20 marbles are likely to fall into 15 of the 30 
bins. 

The probability that no bins contain more than one marble is the same as the 
probability that no two friends share a birthday when there are B different days and 
M friends. We saw in the last problems that this is given by 


M(M—-1))_ M? 
exp 3B Sexpy—-xz Ty. 


Problem A.7 As an error detection scheme, you compute a checksum of b = 32 
bits from the bits of each of M files that you store in a computer and you attach 
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the checksum to the file. When you read the file, you recompute the checksum and 
you compare with the one attached to the file. If the checksums agree, you assume 
that no storage/retrieval error occurred. How large can M be before the probability 
that two files share a checksum exceeds 1076. A similar scheme is used as a digital 
signature to make sure that files are not modified. 


Solution There are B = 2P possible checksums. Let us assume that each file is 
equally likely to get any one of the B checksums. In view of the previous problem, 
we want to find M such that 


M? = 
exp)— >> { = 10°” = exp{—6log(10)} ~ exp{—14}. 


Thus, M?/(2B) = 14, so that M? = 28B = 28 x 2? ana M = 2°/*./28 ~ 
5.3 x 2°/?. With b = 32, we find M ~ 350,000. 


Problem A.8 N people apply for a job with your company. You will interview 
them sequentially but you must either hire or decline a person right at the end of 
the interview. How should you proceed to maximize the chance of picking the best 
of all the candidates? Implicitly, we assume that the qualities of the candidates are 
all independent and equally likely to be any number in (1, ..., Q} where Q is very 
large. 


Solution The best strategy is to interview and decline about M = N /e candidates 
and then hire the first subsequent candidate who is better that those M. Here, e — 
exp{l} ~ 2.72. If no candidate among (M + 1,..., N} is better than the first M, 
you hire the last candidate. 

To justify this procedure, we compute the probability that the candidate you select 
is the best, for a given value of M. By symmetry, the best candidate appears in 
position b with probability 1/N. You then pick the best candidate if b > M and if 
the best candidate among the first b — 1 is among the first M, which has probability 
M/(b — 1), by symmetry. Since probability is additive, the probability p that you 
pick the best candidate is given by 


N N-1 N 
1 M M 1 M 1 M 
p=— > = 5 x f db = —[log(N) — log(M)]. 
N eue NORD N Jm b N 


To find the maximizing value of M, we set the derivative of this expression with 
respect to M equal to zero. This shows that N/M * e. 


Topics: General framework, conditional probability, independence, expecta- 
tion, pdf, cdf, function of random variables, correlation, variance, transforma- 
tion of jpdf. 


B.1 General Framework 


The general model of Probability Theory may seem a bit abstract and disconcerting. 
However, it unifies all the key ideas in a systematic framework and results in a great 
conceptual clarity. You should try to keep in mind this underlying framework when 
we discuss concrete examples. 


B.1.1 Probability Space 


To describe a random experiment, one first specifies the set 2 of all the possible 
outcomes. This set is called the sample space. For instance, when we flip a coin, the 
sample space is 2 = (H, T); when we roll a die, 2 = (1, 2, 3, 4, 5, 6}, when one 
measures a voltage one may have 2 = {i = (—0o, +00); and so on. 

Second, one specifies the probability that the outcome falls in subsets of £2. That 
is, for A C Q, one specifies a number P(A) € [0, 1] that represents the likelihood 
that the random experiment yields an outcome in A. For instance, when rolling a 
die, the probability that the outcome is in a set A C {1, 2, 3, 4, 5, 6} is given by 
P(A) = |A|/6 where |A| is the number of elements of A. When we measure a 
voltage, the probability that it has any given value is typically 0, but the probability 
that it is less than 15 in absolute value may be 9596, which is why we specify the 
probability of subsets, not of specific outcomes. 
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Of course, the specification of the probability of subsets of £2 cannot be arbitrary. 
For instance, if A C B, then one must have P(A) < P(B). Also, P(X2) = 1. 
Moreover, if A and B are disjoint, i.e., if ANB = Ø, then P(AUB) = P(A)+P(B). 
Finally, to be able to approximate a complex set by simple sets, one requires that if 
Ai € A»? C A3 C --- andif A = UnAn, then P(A,) — P(A). Equivalently, if 
A12 A22 A3 2 --- andif A=M,An, then P(A,) > P(A). 

This property also implies the following result. 


B.1.2 Borel-Cantelli Theorem 


Theorem B.1 (Borel-Cantelli Theorem) Let A, be events such that 


oo 
> P(A,) < oc. 
n=1 
Then 
P(A,, i0.) = 0. 


Here, (A,, i.0.} is defined as the set of outcomes c that are in infinitely many 
sets Aj. So, stating that the probability of this set is equal to zero means that 
the probability that the events A, occur for infinitely many n’s is zero. So, the 
probability that the events A, occur infinitely often is equal to zero. In other words, 
for any outcome o that occurs, there is some m such that A, does not occur for any 
n larger than m. 


Proof First note that 

{An, 1.0.} = Dnr B, =: B, 
where B, = Um>nAm is a decreasing sequence of sets. To see this, note that the 
outcome o is in infinitely many sets A,, i.e., that co € (A,, 1.0.}, if and only if for 
every n, the outcome o is in some Am for m > n. Also, @ is in Us, A, = Bn for 
all n if and only if w is in N, Bn = B. Hence o € (A,, i.0.} if and only if w € B. 


Now, B, > B5 2 --.,sothat P(B,) — P(B). Thus, 


P(A,, 1.0.) = P(B) = um P(Bn) 
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and P(B,) < X, P(Am), so that! 


m-n 


P(B,) > 0asn > oo. 
Consequently, P(A;, i.o.) = 0. oO 


You may wonder whether ?^, P(An) = oo implies that P(A,, io.) = 1. As 
a simple counterexample, imagine that you have an infinite collection of coins that 
you solder together in an infinite line, all heads up. Assume also that this long string 
is balanced and that you manage to flip it. Let A, be the event that coin n yields 
heads. In this contraption, either all the coins yield heads, with probability 0.5, or all 
the coins yield tails. Also, P (An) = 0.5, so that 2m P(Ag4) = coand P(A,, 1.0.) = 
0.5. However, we show in the next section that the result holds if the events are 
mutually independent. 

For the sake of completeness, we should mention that it is generally not possible 
to specify the probability of all the subsets of 42. This does not really matter 
in applications. The terminology is that the subsets of «2 with a well-defined 
probability are events. 


B.1.3 Independence 


We say that the events A and B are independent if P(A N B) = P(A)P(B). 


For instance, roll two dice. An outcome is a pair (a,b) € (1,2,..., 6}? where a 
corresponds to the first die and b to the second. The event “the first die yields a 
number in (2, 4, 5}” corresponds to the set of outcomes A = {2, 4,5} x {1,..., 6}. 


The event “the second die yields a number in (2, 4]" is the set of outcomes 
B = {l,...,6} x (2, 4). We can see that A and B are independent since P(A) = 
18/36, P(B) = 12/36 and P(AN B) = 6/36. 

A more subtle notion is that of mutual independence. We say that the events 
(Aj, j € J} are mutually independent if 


P(Djek Aj) = Mjek P(A;), V finite K C J. 


It is easy to construct events that are pairwise independent but not mutually 
independent. For instance, let (2 — (1, 2, 3, 4) where the four outcomes are equally 
likely and let A = (1,2), B = {1,3}, and C = {1,4}. You can check that these 
events are pairwise independent but not mutually independent since P(AN BNC) = 
1/4 Z P(A)P(B)P(C). 


! Recall that if the nonnegative numbers a; are such that $5 dyn < œ, then 3 Am goes to 
Zero as n — oo. 
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B.1.4 Converse of Borel-Cantelli Theorem 


Theorem B.2 (Converse of Borel-Cantelli Theorem) Let {A,,n > 1) bea 


collection of mutually independent events with $^ P(A) = oo. Then P (A,n, i.o.) = 
1. 
[| 
Proof Recall that 
{An, 1.0.} = Nn B, where B, = Un>nAm- 
Hence, 
{An, i.0.}© =U, Bo where B; = Nm>n A6. 


Thus, to prove the theorem, it suffices to show that P (B£) = 0 for all n. Indeed, if 

that is the case, then P(UN_, BS) < pum P(B) = 0 and UA. Be are increasing 

with N and their union is U, Bf, so that P(U, BE) = limy— oo P(UN_ BY) = 0. 
Now, 


mMmM=n m 


P(BS) = P(Om>n AS) = lim P(AN_, AS) 
z N—oo 


lim I^ ,P(A2) = lim II" ,[1 — P(Am)] 
N>0oo N->0oo 


N 
TA n exp{—P(Am)} = üm exp |- p» pa] =0. 


m=n 


lim 
N—oo 


In this derivation we used the facts that 1 — x < exp{—x} and pom P(Am) > oo 
as N — oo. oO 


B.1.5 Conditional Probability 


Let A and B be two events. Assume that P(B) > 0. One defines the conditional 
probability P[A|B] of A given B as follows: 


P(ANB) 


The meaning of P[A|B] is the probability that the outcome of the experiment is 
in A given that it is in B. As an example, say that a random experiment has 1000 
equally likely outcomes. Assume that A contains |A| outcomes and B contains |B| 
outcomes. If we know that the outcome is in B, we know that it is equally likely 
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to be any one of these |B| outcomes. Given that information, the probability that 
the outcome is in A is then the fraction of outcomes in B that are also in A. This 
fraction is 


|AnB| ]|AnB|/100 P(An B) 
|B|]  |B|/1000 X P(B) 


Note that the definition implies that if A and B are independent, then P[A|B] = 
P(A), which makes intuitive sense. Also, 


P(An B) = P[A|B]P(B). 


This expression extends to more than two events. For instance, with events 
(Ai, ..., An} one has 


P(A, A20 C An) = P(A1)P[A2 | Ai]P[As | Ai A2] 
+- P[An | A1 N+- N An-1]. 
To verify this identity, note that the right-hand side is equal to 


P(A, N Az) P(A, N A20 A3) P(A N- An) 
P(A1) P(A10 A2) PU BA a) 


P(A1) 


and this product is equal to the left-had side of the identity above. 


B.1.6 Random Variable 


A random variable X is a function X : R — ‘kt. Thus, one associates a real number 
X (o) to every possible outcome c of the random experiment. 

For instance, when one flips a coin, with 2 = (H, T), one can define a random 
variable X by X(H) = 1 and X(T) = 0. 

One then uses the notation P(X € B) = P(X-!(B)) for B C R where 


X-!(B) := {w € 2|X(o) € B]. 


The interpretation is the natural one: the probability that X € B is the probability 
that the outcome w is such that X (w) € B. 

In particular, one defines the cumulative distribution function (cdf) of the random 
variable X as Fx(x) = P(X € (—oo,x]) =: P(X < x). This function is 
nondecreasing and right-continuous; it tends to zero as x — —oo and to one as 
x — +00. 

Figure B.1 summarizes this general framework for one random variable. 
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Fig. B.1 The random 
experiment is described by a 
set 2 of outcomes: the 
sample space. Subsets of 2 
called events have a 
probability. A random 
variable is a real-valued 
function of the outcome w of 
the random experiment 


R: real line 


P(X e By: = P(X"! (B)) 


B.2 Discrete Random Variable 
B.2.1 Definition 


A discrete random variable X is defined by a list of distinct possible values and 
their probability: 


X = (Gn. Pn), n = 1,2,..., N}. (B.1) 


Here, the x, are real numbers and the p, are positive and add up to one. By 
definition, p, is the probability that X takes the value x, and we write 


Pn = P(X =xn),n=1,...,N. 


The number of values N can be infinite. This list is called the probability mass 
function (pmf) of the random variable X. 
As an example, 


V = {(1, 0.1), (2, 0.3), (3, 0.6)} 


is a random variable that has three possible values (1, 2, 3) and takes these values 
with probability 0.1, 0.3, and 0.6, respectively. Equivalently, one can write 


1, with probability 0.1; 
X = 12, with probability 0.3; 
3, with probability 0.6. 


Note that the probabilities add up to one. 

The connection with the general framework is the following. There is some 
probability space and some function X :  — ù% that happens to take the values 
x1, ..., Xy} and is such that 
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P(X = xn) = P((o e 2|X(@) = xn}) = Pn. 


A possible construction is to define 2 = {1,2,3} with P({1}) = 0.1, P({2}) = 
0.3, P((3) = 0.6, and X(v) = œ. This construction is called the canonical 
probability space. It may not be the natural choice. For instance, say that you pick 
a marble out of a bag that has 10 identical marbles except that one is marked 
with the number 1, three with the number 2, and six with the number 3. Let then 
X be the number on the marble that you pick. A more natural probability space 
has ten outcomes (the ten marbles) and X (œ) is the number on marble w for 
w € Q = {1,2,..., 10}. 

When one is interested in only one random variable X, one cares about its 
possible values and their probability, i.e., its pmf. The details of the random 
experiment do not matter. Thus, one may forget about the bag of marbles. However, 
if the marbles are marked with a second number Y, then one may have to go back 
to the description of the bag of marbles to analyze Y or to analyze the pair (X, Y). 


B.2.2 Expectation 


The expected value, or mean, of the random variable X is denoted E(X) and is 
defined as (Fig. B.2) 


N 
E(X) = pea 


n-l 
In our example, 
E(V)=1 x0.1 +2 x 03--3x0.6 = 2.5. 
As another frequently used example, say that X (w) = 1(o € A} where A is an 


event in 2. We say that X is the indicator of the event A. In this case, X is equal to 
one with probability P (A) and to zero otherwise, so that E(X) = P(A). 


Fig. B.2 The expected value 
of a random variable bu» 


E(X) = > xp(x) 
‘GREAT EXPECTATIONS 
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When N is infinite, the definition makes sense unless the sum of the positive 
terms and that of the negative terms are both infinite. In such a case, one says that 
X does not have an expected value. 

It is a simple exercise to verify that the number a that minimizes E((X — a)?) is 
a = E(X). Thus, the mean is the “least squares estimate" of X. 


B.2.3 Function of a RV 


Consider a function h : K — N and a discrete random variable X (Fig. B.3). Then 
h(X) defines a new random variable with values and probabilities 


{(h(%n), px), n =1,..., N}. 
Note that the values A(x,) may not be distinct, so that to conform to our definition 
of the pmf one should merge identical values and add their probabilities. 
For instance, say that h(1) = h(2) = 10 and h(3) = 15. Then 
h(V) = ((10, 0.4), (15, 0.6)}, 


where we merged the two values (1) and h(2) because they are equal to 10. 
Thus, 


E(h(V)) = 10 x 0.4 + 15 x 0.6 = 13. 


Observe that 


N 


E(h(V)) = D> An) pa. 


n=1 


since 


Fig. B.3 Function of a 
random variable 
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3 
» h(n) Pn = h(1)0.1 + A(2)0.3 + h(3)0.6 


n=1 
= 10 x 0.1 + 10 x 0.3 + 15 x 0.6 
= 10 x (0.1 + 0.3) + 15 x 0.6, 


which agrees with the previous expression. 
Let us state that observation as a theorem. 


Theorem B.3 (Expectation of a Function of a Random Variable) Let X be a 
random variable with p.m.f. ((x,, Pn),n = 1,...,N} and h : K — N some 
function. Then 


N 


E(h(X)) = Y  hGu) ps. 


n=1 


B.2.4 Nonnegative RV 


We say that X is nonnegative, and we write X > 0, if all its possible values x, are 
nonnegative. Observe that 


if X > Oand E(X) = 0, then P(X = 0) = I. 
Also, 


if X > 0 and E(X) < oo, then P(X < co) — 1. 


B.2.5 Linearity of Expectation 


Consider two functions A : K — K and h2 : 9t — N and define A41(X) + h2(X) as 
follows: 


hy(X) + h2(X) = {hi (xn) + haa), pu), n = 1,..., N}. 


As before, 


N 
E (h1 (X) + ho(X)) = Y (hi Gu) + ha Gu) pn- 


n=1 
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By regrouping terms, we see that 
E(hi(X) + h2(X)) = EQ (X)) + E(ho(X)). 


We say that expectation is linear. 


B.2.6 Monotonicity of Expectation 
By X > 0 we mean that all the possible values of X are nonnegative, i.e., that 
X(o) > 0 for all æ. In that case, E(X) > 0 since E(X) = 3^, x, P(X = xn) and 
all the x, are nonnegative. 
We also write X < Y if X(w) < Y (o). The linearity of expectation then implies 
that E(X) < E(Y) since 0 € E(Y — X) = E(Y) — E(X). Hence, 
X < Y implies that E(X) < E(Y). (B.2) 


One says that expectation is monotone. 


B.2.7 Variance, Standard Deviation 
The variance var(X) of a random variable X is defined as (Fig. B.4) 
var(X) = E(X — EQO), 
By linearity, one has 


var(X) = E(X? — 2XE(X) + E(X)’) 
= E(X?) — 2E(X)E(X) + E(XY = E(X?) — E(XY.. 


With (B.1), one finds 


var(V) = E(V2) — EV) = ? x 0.1 +2? x 0.3 + 37 x 0.6 — (2.5? = 0.45. 


Fig. B.4 The variance makes d 
randomness interesting I eo 


'NARIANCE 
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The standard deviation ox of a random variable X is defined as the square root 
of its variance. That is, 


Ox :— y Var(X). 

Note that a random variable W that is equal to E(X) — ox orto E(X) -- ox with 
equal probabilities is such that E(W) — E(X) and var(W) — var(X). In that sense, 
ox is an "equivalent" deviation from the mean. 

Observe that for any a € 3t and any random variable X one has 

var(aX) = a?var(X). (B.3) 
Indeed, 


var(aX) = E((aX)*) — [E (aX)? = E (a° X?) — [aE(X) = a7 E(X?) 


— a^ LE(X) = a’var(X). 


B.2.8 Important Discrete Random Variables 
Here are a few important examples. 


Bernoulli We say that X is Bernoulli with parameter p € [0, 1], and we write 
X =p B(p), i£ 


X —((0,1— p), (4, p)). 
i.e., if 
P(X =0) =1-— pand P(X —1) = p. 


You should check that E(X) — p and var(X) — p(1 — p). This random variable 
models a coin flip where | represents “heads” and 0 “tails.” 


Geometric We say that X is geometrically distributed with parameter p € [O, 1], 
and we write X =p G(p), if 


P(X =n) =(1-p)""'pn= 1. 


You should check that E(X) = 1/p and var(X) = (1 — pp. This random 
variable models the number of coin flips until the first “heads” if the probability of 


? The symbol =p means equal in distribution. 
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Fig. B.5 A geometric 
random variable models the 
number of coin flips until a 
first “heads” 


p.m.f. of B(100,p) 


0.12 


0.10 


0.08 


0.06 


0.04 


0.02 


0.00 


0 20 40 60 80 100 


Fig. B.6 The probability mass function of the B(100, p) distribution, for p = 0.1, 0.2, and 0.5 


heads is p (Fig. B.5). (Sometimes, X — 1 is also called a geometric random variable 
on (0, 1, ...}. One avoids confusion by specifying the range. We will try to stick to 
our definition of X on (1,2, .. .].). 


Binomial We say that X is binomial with parameters N and p, and we write X =p 
B(N, p), if 


P(X =n)= (Ama - py n=... (B.4) 
n 


where 


")- N! 
» ~ (N — nnl 


You should verify that E(X) = Np and var(X) = Np(1-— p). This random variable 
models the number of heads in N coin flips; it is the sum of N independent Bernoulli 
random variables with parameter p. Indeed, there are C ) strings of N symbols in 
(H, T} with n symbols H and N — n symbols T. The probability of each of these 
sequences is p"(1 — p)'-" (Figs. B.6, and B.7). 
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Fig. B.7 The binomial distribution as a sum of Bernoulli random variables. At each step, every 
steel ball moves to the left or to the right with equal probabilities, i.e., by 2X, — 1 where X, is 
Bernoulli 0.5. The position after N steps is Y = $57 n = WOX, — 1) = 2B(N, 0.5) — N. After 
M balls, the stacks show approximately the values of M x P(Y = y) for integer y's 


p.m.f. of P(A) 


[7] 
d 
S] 
d 
8 


10 15. 
n 
Fig. B.8 Poisson pmf, from Wikipedia 


Poisson We say that X is Poisson with parameter A, and we write X =p P(A), if 
A^ 
P(X =n) = —e^,nz 0. (B.5) 
n: 


You should verify that E(X) = à and var(X) = A. This random variable models 
the number of text messages that you receive in 1 day (Fig. B.8). 


B.3 Multiple Discrete Random Variables 


Quite often one is interested in multiple random variables. These random variables 
may be related. For instance, the weight and height of a person, the voltage that a 
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Fig. B.9 Height and weight 


are related 
47" 4'1 1! 5'3" 5'7" 5'1 T^ 6'3" H 
Fig. B.10 The jpmf of a pair Y 
of discrete random variables 
(emt) 
Pm,n 
. e (ri, yj) 
Pij 
> X 
e e e 
(£1, yı) . . 


Di, 


transmitter sends and the one that the receiver gets, and the backlog and delay at a 
queue are pairs of non-independent random variables (Fig. B.9). 


B.3.1 Joint Distribution 


To study such dependent random variables, one needs a description more complete 
than simply looking at the random variables individually. Consider the following 
example. Roll a die and let X = 1 if the outcome is odd and X = 0 otherwise. 
Let also Y = 1 if the outcome is in (2, 3, 4) and Y = O if it is in (1, 5, 6). Note 
that P(X = 1) = P(X = 0) = 0.5 and P(Y = 1) = P(Y = 0) = 05. 
Thus, individually, X and Y could describe the outcomes of flipping two fair coins. 
However, jointly, the pair (X, Y) does not look like the outcomes of two coin flips. 
For instance, X = | and Y = 1 only if the outcome is 3, which has probability 1/6. 
If X and Y were the outcomes of two flips of a fair coin, one would have X = 1 and 
Y = 1 in one out of four equally outcomes. 

In the discrete case, one describes a pair (X, Y) of random variables by listing 
the possible values and their probabilities (see Fig. B.10): 


B.3 Multiple Discrete Random Variables 343 


Pi,j E P(X = xi, Y = yj), VG, j) € {1,...,m} x [1 seus tt}, 
where the p; j are nonnegative and add up to one. Here, m and n can be infinite. 
This description specifies the joint probability mass function (jpmf) of the random 
variables (X, Y). (See Fig. B.10.) 


From this description, one can in particular recover the probability mass of X 
and that of Y. For instance, 


n n 
P(X =xj) = XO P(X = xi, Y = yj) = 3 pij 
j=l j=! 
B.3.2 Independence 
One says that X and Y are independent if 


P(X =x,Y = y) 2 P(X = x)P(Y = y), Vx, y. 


In our die roll example, note that 


PE-LY-D-l4PU-DP-D- 


so that X and Y are not independent (Fig. B.11). 


B.3.3 Expectation of Function of Multiple RVs 


For h : RZ? — R, one then defines 


Fig. B.11 Independence? 
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E(X, Y) = 3 Y hi, yp) Pi. 


i=l j=1 
Note that if h(x, y) = hi(x, y) + h»(x, y), then 


E(X, Y)) = 3 YO ACi, ys) Pi, 


i=l j=1 


m n 
= 9 no») + hoi, yplpij 
i=l j=1 


m n m 


=} 2 mG vp PG + $2 hoo vs) Pi 


i=l j=1 i=l j=1 


= E(hi(X, Y)) + E(h2(X, Y)). 


so that expectation is linear. 


B.3.4 Covariance 
In particular, one defines the covariance of X and Y as 
cov(X, Y) 2 E((X — E(X))(Y — E(Y)). 
By linearity of expectation, one has 
cov(X, Y) = E(XY — E(X)Y — XE(Y) + E(X)E(Y)) = E(XY) — E(X) E(Y). 
One says that X and Y are uncorrelated if cov(X, Y) — 0. One says that X and 
Y are positively correlated if cov(X, Y) > 0 and that they are negatively correlated 


if cov(X, Y) < 0 (Fig. B.12). 
In the die roll example, one finds 


Fig. B.12 These random YA The dots represent equally 
variables are positively likely pairs of values 
correlated: if one is large, the o ` 
other one tends to be large as o ° 
well 9| o o 
> X 
o o 
o [s] 
© 
o 
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cov(X, Y) = E(XY) - E(X)E(Y) = : - ; <0, 


so that X and Y are negatively correlated. This negative correlation suggests that if 

X is larger than average, then Y tends to be smaller than average. In our example, 

we see that if X = 1, then the outcome is odd and Y is more likely to be O than 1. 
Here is an important result: 


Theorem B.4 (Independent Random Variables are Uncorrelated) 


(a) Independent random variables are uncorrelated. 

(b) The converse is not true. 

(c) The variance of a sum of uncorrelated random variables is the sum of their 
variances. 


| 
Proof 


(a) Let X, Y be independent. Then 


E(XY) =) xyP(X =x, Y = y) = S apPUteepOr = y) 


= (Zea = ») (x yP(Y = ») — E(X)E(Y). 
x y 


(b) As a simple example see Fig. B.13, say that (X, Y) is equally likely to take each 
of the following four values: 


{(—1, 0), (0, 1), (0, —1), (1, 0)}. 


Then one sees that E(XY) = 0 = E(X)E(Y) so that X and Y are uncorrelated. 
However, P(X = —1, Y 21) = 0 # P(X = —DP(Y = 1), so that X and Y 
are not independent. 

(c) Let X and Y be uncorrelated random variables. Then 


var(X + Y) = E(X + Y — E(X + Y))) 
= E(X? + Y? - 2XY — E(X? — EY} — 2E(X)E(Y) 
= E(X?) — E(X)* + E(Y?) - E(YY 
= var(X) + var(Y). 


The third equality in this derivation comes from the fact that E(XY) = E(X)E(Y). 
Oo 
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Fig. B.13 The random 

variables X and Y are AY 

uncorrelated but not 1 [e] 

independent [tee pne m 4 


B.3.5 Conditional Expectation 


Consider a pair (X, Y) of discrete random variables such that P(X = xj, Y = yj) = 
pi,j fori = 1,...,m and j = 1,...,n. In particular, P(X = xi) = } P(X = 
xiy =y) ve Pi k. Using the definition of conditional probability, we have 


P(X = xi, Y = yj) 
P(X = xi) 


PIY 2yj|X —xi]— 


Thus, P[Y = y; | X = xi] for j = l,...,n is the conditional distribution, or 
conditional pmf, of Y given that X — xj. 

In particular, note that if X and Y are independent, then P[Y = y; | X = xi] = 
P(Y = yj). 

We define E[Y | X = xi], the conditional expectation of Y given X = xj, as 
follows: 


EY |X =x] =} yr =y; x Sal. 
" 


We then define E[Y | X] to bea new random variable that is equal to E[Y | X — 
xj] when X = x;. That is, E[Y | X] is a function g(X) of X with g(x;) = E[Y | 
X= xj]. 

The interpretation is that we observe X = xj, which tells us that Y now has a new 
distribution: its conditional distribution given that X — x;. Then E[Y | X — xj] is 
the expected value of Y for this conditional distribution. 


Theorem B.5 (Properties of Conditional Expectation) One has 


E(E|Y | Xp = E(Y) (B.6) 
E[h(X)Y | X] =A(X)E[Y | X] (B.7) 
E|Y | X] 2 E(Y), if X and Y are independent. (B.8) 
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Proof To verify (B.6), one notes that 


E(ELY | XD = 9 PO = x)ElY | X xi] = DY PX = xi) 
x  »PDY = yj | X = xi] 
j 
Ry y E= eyla 
i j 
=Y Y BPO =x,Y =y) =Y y PU = yj) = EY). 
i j j 


For (B.7), we recall that, by definition, E[h(X)Y | X] is a random variable that 
takes the value E[h(X)Y | X = xj] when X = xj. Also, E[h(X)Y | X = xj] is the 
expected value of h(X)Y given that X = xj, i.e., of h(x;)Y given that X = xj. By 
linearity of expectation, this is h(x;)E[Y | X = xj]. 

Finally, (B.8) is immediate since the distribution of Y given X — x; is the original 
distribution of Y when X and Y are independent. o 


B.3.6 Conditional Expectation of a Function 


In the same spirit as Theorem B.3, one has the following result: 


Theorem B.6 (Conditional Expectation of a Function of a Random Variable) 
One has 


E[h(Y) | X =x] = 3 AG) PLY 2 yj | X = xil. 
J 


Also, conditional expectation is linear: 


Eth (Y) + h2(Y) | X] = Eth (Y) | X] + Efha(Y) | X]. 


B.4 General Random Variables 


Not all random variables have a discrete set of possible values. For instance, the 
voltage across a phone line, wind speed, temperature, and the time until the next 
customer arrives at a cashier have a continuous range of possible values. 

In practice, one can always approximate values by choosing a finite number 
of bits to represent them. For instance, one can measure temperature in degrees, 
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ignoring fractions, and fixing a lower and upper bound. Thus, discrete random 
variables suffice to describe systems with an arbitrary degree of precision. However, 
this discretization is rather artificial and complicates things. For instance, writing 
Newton's equation F = ma where a = dv(t)/dt with discrete variables is 
rather bizarre since a discrete speed does not admit a derivative. Hence, although 
computers perform all their calculations on discrete variables, the analysis and 
derivation of algorithms are often more natural with general variables. Nevertheless, 
the approximation intuition is useful and we make use of it. 
We start with a definition of a general random variable. 
B.4.1 Definitions 
Definition B.1 (cdf and pdf) Let X be a random variable. 


(a) The cumulative distribution function (cdf) of X is the function Fx (x) defined 
by 


Fx(x) = P(X € x), x En. 
(b) The probability density function (pdf) of X is 
d 
fx(x) = 4-Fx(x), 
dx 


if this derivative exists. 


Observe that, fora < b, 
b 

P(a < X x b) - P(X x b) - P(X < a) = Fx(b) — Fx(a) -[ fx G)dx, 
a 


where the last expression makes sense if the derivative exists. Also, if the pdf exists, 
fx (x)dx = Fx(x + dx) — Fx(x) = P(X € (x, x + dx]. 


This identity explains the term "probability density." 


B.4.2 Examples 


Example B.1 (U[a, b]) As a first example, we say that X is uniformly distributed 
in [a, b], for some a < b, and we write X =p U[a, b] if 
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Fig. B.14 The pdf and cdf of A 
a U[a, b] random variable 


c.d.f. of Expo(A) 


Fig. B.15 Density of exponential distribution 


fx(x) = ilu <x<b}. 


—a 


In this case, we see that 


f x—a 
Fx) = max [o min{1, b ib 
Figure B.14 illustrates the pdf and the cdf of a U[a, b] random variable. 


Example B.2 (Exp(X)) As a second example, we say that X is exponentially 
distributed with rate à > 0, and we write X =p Exp(A), if 


fx(x) 2 Ae ^" 1x > 0). 


Figure B.15. 
As before, you can verify that 


Fx(x) = 1 — exp(—Ax], for x > 0, 
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Fig. B.16 A discrete P(X) 
approximation of a 0.25 | 
continuous random variable 
0.2 4 
0.15 4 
0.1 4 
0.05 4 
X 


so that 
P(X > x) = exp{—Ax}, Vx > 0. 


It may help intuition to realize that a random variable X with cdf Fy(-) 
can be approximated by a discrete random variable Y that takes values in 
{..., —2e, —e, 0, €, 2e,...} with 


P(Y = ne) = Fx((n + De) — Fx(ne) = P(X € (ne, (n+ Dep. 


Figure B.16. 


B.4.3 Expectation 


For a function A : R — Ñ, one has 


oo 
E(h(Y)) = Sohne) Fx ((n + De) — Fx(ne)] © J h(x)d Fx (x), 
n —Ooo0 
where the last term is defined as the limit of the sum as € — 0. If the pdf exists, one 
sees that 


oo 


zawya f h(x) fx (x)dx. 


If € is very small, the approximation of X by Y is very close, so that the expressions 
for E(h(Y)) should approach E(h(X)). We state these observations as a theorem. 


Theorem B.7 (Expectation of a Function of a Random Variable) Let X be a 
random variable with cdf Fx (-) and h : 9t — Ñ some function. Then 
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E(h(X)) -[ h(x)d Fx (x). 


If the pdf fx (-) of X exists, then 


EG» | h(x) fx (x)dx. 


For example, if X =p U[0, 1], then 


1 


1 
ga = f x dx = —. 
0 k+1 


In particular, 
" > 1 Ae 1 
var(X) = E(X^) — E(X) p M E 


As another example, if X =p Exp(A), then 


oO eo b 
E(X) = xhe "dx = -f xde™ = —[xe ^" Yee +f e "dx 
0 0 Ü 
= ANE y mA. 


Also, 


oo 
= am | xe "dx 22417, 
0 


In particular, 


var(X) = E(X2)) — EX 2217? - G7 =a7?. 


As a generally confusing example, consider the random variable that is equal 
to 0.3 with probability 0.4 and is uniformly distributed in [0, 1] with probability 
0.6. That is, one flips a biased coins that yields "head" with probability 0.4. If the 
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Fig. B.17 The pdf and cdf of 
the mixed random variable X 


0.6- - 
0.45(x - 0.3) 


outcome of the coin flip is head, then X = 0.3. If the outcome is tail, then X is 
picked uniformly in [0, 1]. Then, 


Fx(x) = P(X € x) 2 04 x 1(x > 0.3) + 0.6x, x € [0, 1]. 


This cdf is illustrated in Fig. B.17. We can define the derivative of Fy (x) formally 
by using the Dirac impulse as the formal derivative of a step function. 
For this random variable, one finds that? 


E(x*) = f j x* fy (dx) 


oo 1 
= J x*0.48(x — 0.3)dx + Í x*0.6dx 
0 


—cCo 
= 0.40.3) + 0.6— 
= 0.4(0. BID 
In particular, we find that 


var(X) = E(X?) — E(X} 


1 17? 
= 0, 4(0.3)? + 0.63 — [040 4 0.65 | = 0.0596. 


3Recall that the Dirac impulse is defined by 
oo 
NE - adr = gta) 
—oo 


for any function g(-) that is continuous at a. 
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Fig. B.18 The random 
variables X, converge to zero 
but their expectation does not 


B.4.4 Continuity of Expectation 


We state without proof two useful technical properties of expectation. They address 
the following question. Assume that X, — X as n — oo. Can we conclude that 
E(X4) —> E(X)? In other words, is expectation "continuous"? 

The following counterexample shows that some conditions are needed (see 
Fig. B.18). Say that w is chosen uniformly in [0, 1], so that P([0,a]) = a for 
a € Q := (0,1]. Define X,(v) = n x l(o < 1/n) for n > 1. That is, 
Xn(@) = nif c x 1/n and X (œ) = 0 otherwise. Then P(X, = n) = 1/n 
and P(X, = 0) = 1 — 1/n, so that E(X,) = 1 for all n. Also, Xn (œ) — 0 as 
n — œ, for all o € 92. Indeed, X, (œw) = 0 for all n > 1/m. Thus, X; > X = 0 
but E(X,) = 1 does not converge to 0 = E(X). 


Theorem B.8 (Dominated Convergence Theorem  (DCT)) Assume that 
|[Xn(@)| < Y(o) for all o € Q where E(Y) < oco. Assume also that, for all 
w € R, X4 (o) > X(w) asn > oo. Then E(X,) > E(X) asn > oo. 

in 
Theorem B.9 (Monotone Convergence Theorem (MCT)) Assume that 0 < 
X, (0) € Xn41(@) for all o and n = 1,2,.... Assume also that X (o) > X (w) as 
n — co for all o € 2. Then E(X,) > E(X) asn > co. 

a 


One also has the following useful fact. 


Theorem B.10 (Expectation as Integral of Complementary cdf) Let X > 0 be 
a nonnegative random variable with E(X) < oo. Then 


E(X) = [a P(X > x)dx. 
0 
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Proof Recall the integration by parts formula: 


b b 
f u(x)dv(x) = [4 (x)v(x)]? -f v(x)du(x) 
that follows from the fact that 


d 
zg URDU] = u(x)v(x) + u(x)v'(x). 
x 


Using that formula, one finds 
oo oo 
Eoo - f xdFy(x) =- f xd(1l— Fx(x)) 
0 0 


= —[x(1 — Fx(x))19° +f (1 — Fx(x)dx = f P(X > x)dx. 
0 0 


For the last equality we use the fact that x(1 — Fx(x)) = xP(X > x) goes to 
zero as x — oo. This fact follows from DCT. To see this, define X, = nl(X > n}. 
Then |X,| < X for all n. Also, X, — O0 as n — œ. Since E(X) < co, DCT then 
implies that n P(X > n) = E(X,) > 0. 

Oo 


The function P(X > x) = 1 — Fx(x) is called the complementary cdf. 
For instance, if X =p Exp(A), then 


E(X)= ee P(X > x)dx = i exp{—Ax}dx = d 
0 0 À 


As another example, if X =p G(p), then P(X > x) = (1— p)" for x € [n, n4-1) 
and 


oo 
1 
E(X) = P(X > x)dx = 330 - p)" = — 
0 n>0 P 

B.5 Multiple Random Variables 
B.5.1 Random Vector 
A random vector X = (X1,..., X,)/ is a vector whose components are random 
variables defined on the same probability space. That is, it is a function X : (2 > 


R”. One then defines the joint cumulative distribution function (jcdf) Fx as follows: 


Fx(x) = P(X, € x1,...,Xn < xy), x EN". 
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The derivative of this function, if it exists, is the joint probability density function 
(pdf) fx(x, ). That is, 


X1 Xn 
F(x.) =f «f fx (uUdui -+-dun. 


The interpretation of the jpdf is that 
fxG)dxi---dx, = P(Xm € (Xm, Xm + dxm) for m = 1,...,n). 


For instance, let 


1 


fxy@,y)=—1 [2s y? < i} x,y e$. 
T 


Then, we say that (X, Y) is picked uniformly at random inside the unit circle. 

One intuitive way to look at these random variables is to approximate them by 
points on a fine grid with size e > 0. For instance, an e-approximation of a pair 
(X, Y) is (X, Y) defined by 

(X, Y) = (me, ne) w. p. fx y (me, ne)e?. 


This approximation suggests that 


E(h(X, Y)) = Y hme, ne) fx y (me, ne) 


m,n 
oo oo 
~ f Í h(x, y) fx,y (x, y)dxdy. 
—oo J—00 
We take this as a definition. 


Definition B.2 Let (X, Y) be a pair of random variables and h : R? — NR. If the 
jpdf exists, then 


ERX, Y) = he ie h(x, y) f y (x, y)dxdy. 
More generally, 
E(h(X)) = E ee D h(x)dxı ---dXn. 
o 


This definition guarantees that expectation is linear. The covariance of X and Y 
is defined as before. 
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Definition B.3 (Independence) Two random variables X and Y are independent if 
P(Xc€eA,YeB)-— P(X e A)P(Y e B) 
for all sets A and B in 9t. 
o 


It is a simple exercise to show that, if the jpdf exists, the random variables are 
independent if and only if 


fxy(x, y) = fx(x) fro), Vx, y e R. 


If X is a random variable and g : N — NÑ is some function, then g(X) is a 
random variable. Note that 


g(X) € A if and only if X € g^! (A) := (x et | g(x) € A}. 


Of course, this is a tautology. 
Here is a very useful observation. 


Theorem B.11 (Functions of Independent Random Variables are Independent) 
Let X, Y be two independent random variables and g, h : R — NÑ be two functions. 
Then g(X) and h(Y) are two independent random variables. 


= 
Proof Note that 
P(g(X) € A, h(Y) € B) = P(X € g KA, Y en !(B) 
= P(X eg (A)P(Y eh! (B)) 
= P(g(X) e A)P(h(Y) e B). 
m| 


B.5.2 Minimum and Maximum of Independent RVs 
One is often led to considering the minimum or the maximum of independent 
random variables. The basic observation is as follows. Let X, Y be independent 


random variables. Let V = min( X, Y} and W = max(X, Y}. Then, 


P(V >v)= P(X > v,Y >v) = P(X >v)P(Y >v). 
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Also, 
P(W zw)-P(Xzxw,Yzw)-P(X-zw)P(Y x w). 


These observations often suffice to do useful calculations. 
For example, assume that X = Exp(A) and Y = Exp(w). Then 


P(V > v) = P(X > v)P(Y > v) = exp{—Av} exp{—v} = exp{—( + u)v}. 
Thus, the minimum of two exponentially distributed random variables is exponen- 
tially distributed, with a rate equal to the sum of the rates. 


Let X, Y be i.i.d. U[0, 1]. Then, 


P(W < w) = P(X € w)P(Y < w) =v”, for w € [0, 1]. 


B.5.3 Sum of Independent Random Variables 


Let X, Y be independent random variables and let Z = X + Y. We want to calculate 
fz (2) from fx (x) and fy (y). The idea is that 


+00 
P(Zec @z+dd)= | P(X e (x, x -dx), Y € (z=x,z =x +dz)). 


—oo 


Hence, 
+00 
fz(z)dz = f fx GO fy (z — x)dxdz. 
We conclude that 
+00 
fee) [ fx) fv (z — x)dx = fx * fy (x), 


where g * h indicates the convolution of two functions. If you took a class on signals 
and systems, you learned the “flip and drag" graphical method to find a convolution. 
B.6 Random Vectors 

In many situations, one is interested in a collection of random variables. 

Definition B.4 (Random Vector) A random vector X = (X4,..., X4)! is a vector 


whose components are random variables. It is characterized by the Joint Cumulative 
Probability Distribution Function (jcdf) 
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Fx(x1,..., Xn) :— P(X < xq, ..., Xn E Xn), x; E Ñ, i = lu ...,n. 


The Joint Probability Density Function (jpdf) is the function fx (x) such that 


X1 Xn 
FG. x | -f fx(ui,...,ug)dui... dug, 
—0o —0o 


if such a function exists. In that case, 
Fx(x)dx1...dx, = P(X; € [xi, xi + dxi], i = l,...,n). 
o 
Thus, the jcdf and the jpdf specify the likelihood that the random vector takes 


values in given subsets of N”. 
As in the case of two random variables, one has 


EO) = f f hofran ... tus. 


if the jpdf exists. 
The following definitions are used frequently. 


Definition B.5 (Mean and Covariance) Let X, Y be random vectors. One defines 


E(X) = (E(X1),  EQUY 
Xx = E((X — E(X) (X - E(X) ^) 
cov(X, Y) = E(X — EX) (Y — E(Y))). 


We say that X and Y are uncorrelated if cov(X, Y) = 0, i.e., if X; and Y; are 
uncorrelated for all i, j. 


o 

Thus, the mean value of a vector is the vector of mean values. Similarly, the mean 

value of a matrix is defined as the matrix of mean values. Also, the covariance of X 
and Y is the matrix of covariances. Indeed, 


cov(X, Y); j; = E((Xi — E(Xi))(Y; — E(Yj)) = cov(X;, Yj). 


Note also that Xy = cov(X, X) =: cov(X). 
As a simple exercise, note that 


cov(AX + a, BY + b) = Acov(X, Y) B. 
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Fig. B.19 A geometric view 
of orthogonality 


ti 


B.6.1 Orthogonality and Projection 
The notions of orthogonality and of projection are essential when studying estima- 
Bs X and Y be two random vectors. We say that X and Y are orthogonal and we 
write X L Y if 

E(XY’) — 0. 


Thus, X and Y are orthogonal if and only if each X; is orthogonal to every Yj. 
Note that if E(X) = 0, then X 1 Y if and only if cov(X, Y) = 0. Indeed, 


cov(X, Y) = E(XY’) — E(X) E(Yy = E(XY’), 


since E(X) — 0. 
The following fact is very useful (see Fig. B.19). 


Theorem B.12 (Orthogonality) /fX | Y, then 
E(IY — XI?) = ECIXIP) + ECIYIP). 


This statement is the equivalent of Pythagoras’ theorem. 


Proof One has 


E(|Y — XI?) = E(Y — X)' (Y — X) = E(Y'Y) — 2E(X'Y) + E(X'X) 
= E(lYI?) — 2E(X'Y) + E(\|X||?). 


Now, if X L Y, then E(X;Y;) = O for all i, j. Consequently, E(X'Y) 
^; E(Xi Yi) = 0. This proves the result. m 
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B.7 Density of a Function of Random Variables 
Assume that X has a known p.d.f. fx(x) on R” and that g : R” — N” isa 
differentiable function. Let Y = g(X). How do we find fy (y)? 

We start with the linear case and then explain the general case. 
B.7.1 Linear Transformations 
Assume that X has p.d.f. fx(x). Let Y = aX + b for some a > 0. How do we 
calculate fy (y)? 


As we see in Fig. B.20, we have 


P(Y € (y, y - dy) = P(aX +b € (y, y + dy) 
= P(X € (a™! (y — b), a7! (y + dy — b). 


Recall that P(Z € (z, z + dz)) = fz(z)dz. Accordingly, 
frQ)dy = fx(a (y — b)) x a! dy, 
so that 


fr = Lf (x) where ax +b = y. (B.9) 


The case a < O is not that different. Repeating the argument above, one finds 


fv) = fxs where ax + b = y. 


What about a pair of random variables? Assume that X is a random vector that 


takes values in R? with p.d.f. fx (x). Let 


Fig. B.20 The linear 
transformation Y = aX +b 
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Area = |A|dz dx» 


Fig. B.21 The linear transformation Y = AX + b 


Y = AX + b, 


where A € R?*? and b e 92. 

Figure B.21 shows that, under the linear transformation, the rectangle [xi, x1 + 
dx] x [x2, x2 + dx2] gets mapped into a parallelogram with area | A|dx1d x2 where 
|A| is the absolute value of the determinant of the matrix A. Hence, the probability 
that Y falls in this parallelogram with area |A|dx1dx» is fx(x)dx1dxo. Since the 
probability that Y takes value in a small area is proportional to that area, the 
probability that Y falls in this parallelogram with area | A|dx1d x»? is also given by 
fy(y)|A|dxıdx2 where y = Ax + b. Thus, 


fy(y)|Aldxidx2 = fx(X)dxidxo, with y = Ax +b. 


Hence, 


1 
fv() = rz fxG0 where Ax +b = y. 


In fact, this result holds for n random variables. 
Given the importance of this result, we state it as a theorem. 


Theorem B.13 (Change of Density Through Linear Transformation) Let Y — 
AX + b where A is an n x n nonsingular matrix. Then 


fv = TL where Ax 4- b — y. (B.10) 
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Fig. B.22 A singular 
transformation 
Y= (X1, Xy 


When the matrix A is singular, the random vector Y = AX + b takes values in 
a set of dimension less than n. In that case, the vector Y does not have a density in 
9i". As a simple example of this situation, assume that X, and X» are independent 
and uniformly distributed in [0, 1]. (See Fig. B.22.) 

Let 


v=)! Lex sheds t9. 
X1 10 
Then Y has no density in RŽ. Indeed, if it had one, one would find that, with 


L = {y | yı = yı and0 < yı < 1}, 
pa ep - f [ fo -o 


since L has measure 0 in 92. But P(Y € L) = 1. 


B.7.2 Nonlinear Transformations 


The case when Y = g(X) for a nonlinear function g(-) is slightly more tricky. Let 
us look at one example first. 


First Example 
Say that X =p U[0, 1] and Y = X?, as shown in Fig. B.23. 
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Y € [y,y +e) 


Fig. B.23 The transformation Y = X? with X =p U[0, 1] 


As the figure shows, for O < e « 1, one has Y € [y, y + €) if and only if 
X € [x1, x1 + ê) where 


= E = A where g (x1) = A =y. 
Now,’ 
PY € [y, y +€)) = frye + o(e) 
and 


P(X € [xi, x1 + 9)) = fx Gió + o(8). 


Also, o(6) = o(e). Hence, 


1 
fy (ye + o(e) = fx Ga) + o(e) = — — fx (xi)e + oe), 
8g' xı) 


tRecall that o(€) designates a function of e such that ats) > 0ase— 0. 
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so that 
1 
fy (y) = =—~ fx (1) where g(x1) = y. 
g' (x1) 


In this example, we see that 


1 
fry) = 2Jy 


because g'(x1) = 2x1 = 2,/y and fx (x1) = 1. 
Second Example 


We now look at a slightly more complex example. Assume that Y = g(X) = X? 
where X takes values in [—1, 1] and has p.d.f. 


fx(x) = ža +x), x e[-1, 1]. 


Figure B.24. 
Consider one value of y € (0, 1). Note that there are now two values of x, namely 
xı = ./y and x2 = —./y such that g(x) = y. Thus, 


Fig. B.24 The transformation Y = X? with X € [—1, 1] 
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P(Y € (y, y c €) = P(X € Qa, x1 + 61)) + P(X € (x2 — 62, x2)), 


where 


€ € 
= —— and & = —_. 
g'(x) Ig Gc2)] 


Hence, 


fy Q)e + o(€) = Jag S xı) + TP goi 4092 * «(€ 


and we conclude that 


fv») = yap" X1) + —— fx (x2). 
zu 2)| 


For this specific example, we find 


PX 


fro) = jd sp =o a -Y = Tx 


NE 8 2 y 8 
Third Example 


Our next example is a general differentiable function g(-). From the second example, 
we can see that if Y = g(X), then 


fro = » E vog e xi), (B.11) 


where the sum is over all the x; such that g(x;) = y. 
Fourth Example 


What about the multi-dimensional case? The key idea is that, locally, the transfor- 
mation from x to y looks linear. Observe that 


0 
gi(x + dx) © gi(x) + 9 | —gi(0dx; © g(x) + Jax, 
j J 


where J (x) is the matrix defined by 


D 
Ji, j (x) — 3x; 9i (x). 
J 


This matrix is called the Jacobian of the function g : R” — N”. Thus, locally, 
the transformation looks like Y = AX + b where b = g(x) and A = J(x). 
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Consequently, the density of fx around x such that g(x) = y gets transformed 
as if the transformation were linear: it is stretched by the determinant of J (x). 
Consequently, we have the following theorem. 


Theorem B.14 (Density of Function of Random Variables) Assume that Y — 
g(X) where X has density fx in R” and g : R” — R” is differentiable. Then 


1 
fv( = 23 Jap A09 


i 


where the sum is over all the x; such that g(xj) = y and |J (x;)| is the absolute value 
of the determinant of the Jacobian evaluated at Xj. 


Here is an example to illustrate this result. Assume that X — (X1, X2) where the 
X; are i.i.d. U[O, 1]. Consider the transformation 


Yı = X? + X2 and Y = 2X1 X2. 
Then 


Jœ) = p. ] 


2x2 2x1 
Hence, 
|J(x)| = 4x? — x2]. 


There are two values of x that correspond to each value of y. These values are 


1 1 
xi -75[n +92 + /n-»] and x2 = 5 [V1 *»2- n -» 


and 


1 1 
xi = 5 [V31 + y2 — Vy — y2] and x2 = 5 [V1 + y2 + /n-»]. 


For these values, 


[J (x)| = yy? - y2. 
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Hence, 


for all possible values of y. 


B.8 References 


Mastering probability theory requires curiosity, intuition, and patience. Good books 
are very helpful. Personally, I enjoyed Pitman (1993). The home page of David 
Aldous (2018) is a source of witty and inspiring comments about probability. The 
textbooks Bertsekas and Tsitsiklis (2008), Grimmett and Stirzaker (2001), and 
Billingsley (2012) are very useful. The text Wong and Hajek (1985) provides a 
deeper discussion of the topics in this book. The books Gallager (2014) and Hajek 
(2017) are great resources and are highly recommended to complement this course. 
Wikipedia and YouTube are cool sources of information about everything, 
including probability. I like to joke, “Don’t take notes, it's all on the web." 


B.9 Problems 
Problem B.1 You have a collection of coins and that the probability that coin n 
yields heads is pn. Show that, as you keep flipping the coins, the flips yield a finite 
number of heads if and only if $^ p, < oo. 
Hint This is a direct consequence of the Borel-Cantelli Theorem and its converse. 
Problem B.2 Indicate whether the following statements are true or false: 
(a) Disjoint events are independent. 
(b) The variance of a sum of random variables is always the sum of their variances. 
(c) The expected value of a sum of random variables is the sum of their expected 

values. 
Problem B.3 Provide examples of events A, B, C such that 

P[A|C] « P(A), P[A|B] > P(A) and P[B|A] > P(B). 

Problem B.4 Roll two balanced dice. Let A be the event “the sum of the faces is 


less than or equal to 8.” Let B be the event "the face of the first die is larger than or 
equal to 3.” 
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e What is the probability space (2, F, P)? 
e Calculate P[A|B] and P[B|A]. 


Problem B.5 You flip a fair coin repeatedly, forever. 


* What is the probability that out of the first 1000 flips the number of heads is 
even? 

* What is the probability that the number of heads is always ahead of the number 
of tails in the first 4 flips? 


Problem B.6 Let X, Y be iid. Exp(l), i.e., exponentially distributed with rate 1. 
Derive the p.d.f. of Z = X + Y. 


Problem B.7 You pick four cards randomly from a perfectly shuffled 52-card deck. 
Assume that the four cards you got are all numbered between 2 and 10. For instance, 
you got a 2 of diamonds, a 10 of hearts, a 6 of clubs, and a 2 of spades. Write 
a MATLAB script to calculate the probability that the sum of the numbers on the 
black cards is exactly twice the sum of the numbers on the red cards. 


Problem B.8 Let X =p G(p), i.e., geometrically distributed with parameter p. 
Calculate E(X?). 


Problem B.9 Let X, Y bei.i.d. Up[0, 1]. Calculate E(max(X, Y) — min{X, Y}). 


Problem B.10 Let X =p P(A) (i.e., Poisson distributed with mean A). Find 
P(X is even). 


Problem B.11 Consider (2 = [0, 1] with the uniform distribution. Let X (œ) = 
l{a < w < b)andY = l{c < w < d) forsome0 «a «b «land0O «c «d « I. 
Assume that X and Y are uncorrelated. Are they necessarily independent? 


Problem B.12 Let X and Y be i.i.d. U[—1, 1] and define Z — XY. Are X and Z 
uncorrelated? Are they independent? 


Problem B.13 Let X =p U[—1,3] and Y = X?. Calculate fy (-). 


Problem B.14 You are given a one meter long stick. You choose two points X and 
Y independently and uniformly along the stick and cut the stick at those two points. 
What is the probability that you can make a triangle with the three pieces? 


Problem B.15 Two friends go independently to a bar at times that are uniformly 
distributed between 5:00 pm and 6:00 pm. They wait for ten minutes when they get 
there. What is the probability that they meet? 
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Problem B.16 Choose V > 0 so that V? =p Exp(2). Now choose 0. =p 
U[0, 27], independent of V. Define X = V cos(0) and Y = V sin(@). Calculate 


Ixy (x, y). 


Problem B.17 Assume that Z and 1/Z are random variables with the same 
probability distribution and such that E (|Z|) is well-defined. Show that E(|Z|) > 1. 


Problem B.18 Let {X,, > 1} be i.i.d. with mean 0 and variance 1. Define Y, = 
(Xi o X4)/n. 


(a) Calculate var(Y, ). 
(b) Show that P(|Y,| > €) — Oas n > œ, for all e > 0. 


Problem B.19 Let X, Y be i.i.d. U[0, 1] and Z = A(X, Y)! where A isa given 
2 x 2 matrix. What is the p.d.f. of Z? 


Problem B.20 Let X =p U[1, 7] and Y = In(X) + 3//X. Show that E(Y) < 7.4. 


Problem B.21 Pick two points X and Y independently and uniformly in [0, 1^. 
Calculate E(|| X — Y||*). 


Problem B.22 Let (X,Y) be picked uniformly in the triangle with corners 
(—1, 0), (1, 0), (0, 1). Find cov(X, Y). 


Problem B.23 Let X be a random variable with mean 1 and variance 0.5. Show 
that 


EQX-EAX + X’) > 8.5. 


Problem B.24 Let X, Y, Z be i.i.d. and uniformly distributed in (—1, +1} (i.e., 
equally likely to be —1 or +1). Define Vj = XY, V; = YZ, V3 = XZ. 


(a) Are (Vi, V2, V3} pairwise independent? Prove. 
(b) Are (Vi, V2, V3} mutually independent? Prove. 


Problem B.25 Let A and B be events with probabilities P(A) — 3/4 and P(B) — 
1/3. Show that b < P(AN B) < 1/3, and give examples to show that both upper 
and lower bound are tight. Find corresponding bounds for P(A U B). 


Problem B.26 A power system supplies electricity to a city from N plants. Each 
power plant fails with probability p independently of the other plants. The city will 
experience a blackout if fewer than k plants are supplying it, where 0 < k < N. 
What is the probability of blackout? 
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Fig. B.25 Reliability graph 5 
of a system 


Fig. B.26 A circuit used as a 
simple timer. An external 
circuit detects when the 


voltage V (t) drops below 1V 


Problem B.27 Figure B.25 is the reliability graph of a system. The links of the 
graph represent components of the system. Each link i is working with probability 
pi and defective with probability 1— p;, independently of the other links. The system 
is operational if the nodes S and T are connected. Thus, the system is built of two 
redundant subsystems. Each subsystem consists of a number of components. 


(a) Calculate the probability that the system is operational. 

(b) Assume now the reliability graph is a binary tree with n levels and that the links 
fail independently with probability 1 — p. What is the probability g (n) that there 
is a working path from the root to a leaf? 

(c) Show that g(n) > 0 as n > oo if p < 0.5. Also, prove that g(n) > q > Oif 
p > 0.5. What is the limit q? 


Problem B.28 Figure B.26 illustrates an RC-circuit used as a timer. Initially, the 
capacitor is charged by the power supply to 5 V. At time t = 0, the switch is flipped 
and the capacitor starts discharging through the resistor. An external circuit detects 
the time t when V (t) first drops below 1 V. 


(a) Calculate t in terms of R and C. 

(b) Assume now that R and C are independent random variables that are uniformly 
distributed in [Ro(1 — €), Ro(1 + €)] and [Co(1 — €), Co(1 + €)], respectively. 
Calculate the variance of r. 

(c) Let to be the value of t that corresponds to R = Ro and C = Co. Find an upper 
bound on the probability that |t — t | > óto for some small ô. 
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Fig. B.27 Alice and Bob 
play the game “matching 
pennies” 


Problem B.29 Alice and Bob play the game of matching pennies. In this game, 
they both choose the side of the penny to show. Alice wins if the two sides are 
different and Bob wins otherwise (Fig. B.27). 


(a) Assume that Alice chooses to show “head” with probability p4 € [0, 1]. 
Calculate the probability pg with which Bob should show “head” to maximize 
his probability of winning. 

(b) From your calculations, find the best choices of p4 and pg for Alice and Bob. 
Argue that those choices are such that Alice cannot improve her chance of 
winning by modifying pA and similarly for Bob. A solution with that property 
is called a Nash equilibrium. 


Problem B.30 You find two old batteries in a drawer. They produce the voltages X 
and Y. Assume that X and Y are i.i.d. and uniformly distributed in [0, 1.5]. 


(a) What is the probability that if you put them in series they produce a voltage 
larger than 2? 

(b) What is the probability that at least one of the two batteries has a voltage that 
exceeds 1V? 

(c) What is the probability that both batteries have a voltage that exceeds 1 V? 

(d) You find more similar batteries in that drawer. You test them one by one until 
you find one whose voltage exceeds 1.2 V. What is the expected number of 
batteries that you have to test? 

(e) You pick three batteries. What is the probability that at least two of them have 
voltages that add up to more than 2? (Fig. B.28). 


Problem B.31 You want to sell your old iPhone 4S. Two friends, Alice and Bob, 
are interested. You know that they value the phone at X and Y , respectively, where X 
and Y are i.i.d. U[50, 150]. You propose the following auction. You ask for a price 
R. If Alice bids A and Bob B, then the phone goes to the highest bidder, provided 
that it is larger than R, and the highest bidder pays the maximum of the second bid 
and R. Thus, if A « R < B, then Bob gets the phone and pays R. If R < A « B, 
then Bob gets the phone and pays A (Fig. B.29). 


(a) What is the expected payment that you get for the phone if A — X and B — Y? 
(b) Find the value of R that maximizes this expected payment. 
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Fig. B.28 Batteries q. Am 
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Fig. B.29 Alice and Bob 
have private valuations X and 
Y for the phone and they bid 
A and B, respectively 


(c) The surplus of Alice is X — P if she gets the phone and pays P for it; it is 
zero if she does not get the phone. Bob’s surplus is defined similarly. Show that 
Alice maximizes her expected surplus by bidding A = X and similarly for Bob. 
We say that this auction is incentive compatible. Also, this auction is revenue 
maximizing. 


Problem B.32 Recall that the trace tr(S) of a square matrix S is the sum of its 
diagonal elements. Let A be an m x n matrix and B an n x m matrix. Show that 
tr(AB) — tr(BA). 


Problem B.33 Let X be the covariance of some random vector X. Show that 
a' Xa > 0 for all real vector a. 
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Fig. B.30 What size solar 
panels should you buy for 
your house? 


Problem B.34 You want to buy solar panels for your house. Panels that deliver a 
maximum power K cost aK per unit of time, after amortizing the cost over the 
lifetime of the panels. Assume that the actual power Z that such panels deliver is 
U[0, K] (Fig. B.30). 

The power X that you need is U[0, A] and we assume it is independent of the 
power that the solar panels deliver. If you buy panels with a maximum power K, 
your cost per unit time is 


ak + B max(0, X — Z}, 
where the last term is the amount of power that you have to buy from the grid. Find 


the maximum power K of the panels you should buy to minimize your expected 
cost per unit time. 
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